Introduction

In order to perform the different types of regression models, I decided to use the dataset about Sleep health and lifestyle. The regression techniques I will be using are listed below. To prevent the problem of over-fitting, I will split the data into training set, validation set, and test set. I will not only explain the results I obtain, but also compare the results between different models. Last but not least, I will be discussing some possible problems for future studies.

Regression Techniques

1. Parametric Regression Models

a. Simple Linear Regression

b. Polynomial Linear Regression

2. Nonparametric Regression Models

a. Kernal Regression

b. Regression Trees

c. Locally Weighted Regression

Check Type & Structure of Data

data = read.csv(file = 'Sleep_health_and_lifestyle_dataset.csv', header = TRUE, sep = ',')
class(data)
## [1] "data.frame"
str(data)
## 'data.frame':    374 obs. of  13 variables:
##  $ Person.ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                 : chr  "Male" "Male" "Male" "Male" ...
##  $ Age                    : int  27 28 28 28 28 28 29 29 29 29 ...
##  $ Occupation             : chr  "Software Engineer" "Doctor" "Doctor" "Sales Representative" ...
##  $ Sleep.Duration         : num  6.1 6.2 6.2 5.9 5.9 5.9 6.3 7.8 7.8 7.8 ...
##  $ Quality.of.Sleep       : int  6 6 6 4 4 4 6 7 7 7 ...
##  $ Physical.Activity.Level: int  42 60 60 30 30 30 40 75 75 75 ...
##  $ Stress.Level           : int  6 8 8 8 8 8 7 6 6 6 ...
##  $ BMI.Category           : chr  "Overweight" "Normal" "Normal" "Obese" ...
##  $ Blood.Pressure         : chr  "126/83" "125/80" "125/80" "140/90" ...
##  $ Heart.Rate             : int  77 75 75 85 85 85 82 70 70 70 ...
##  $ Daily.Steps            : int  4200 10000 10000 3000 3000 3000 3500 8000 8000 8000 ...
##  $ Sleep.Disorder         : chr  "None" "None" "None" "Sleep Apnea" ...
cat("Number of rows with missing values:", sum(!complete.cases(data)), "\n")
## Number of rows with missing values: 0
missing_counts <- colSums(is.na(data))
columns_with_missing <- sum(missing_counts > 0)
cat("Number of columns with missing values:", columns_with_missing, "\n")
## Number of columns with missing values: 0
if (columns_with_missing > 0) {
  cat("Columns with missing values:", paste(names(missing_counts[missing_counts > 0]), collapse = ", "), "\n")
}

Split into Sets

I will split the data into 70% training, 15% validation, and 15% test sets.

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
trainIndex <- createDataPartition(data$Quality.of.Sleep, p = 0.7, 
                                  list = FALSE)
train <- data[trainIndex, ]
data <- data[-trainIndex, ]

validIndex <- createDataPartition(data$Quality.of.Sleep, p = 0.5, 
                                  list = FALSE)
valid <- data[validIndex, ]
test <- data[-validIndex, ]

n_train <- nrow(train)
n_valid <- nrow(valid)
n_test <- nrow(test)

cat("Number of data points in the training set:", n_train, "\n")
## Number of data points in the training set: 263
cat("Number of data points in the validation set:", n_valid, "\n")
## Number of data points in the validation set: 57
cat("Number of data points in the test set:", n_test, "\n")
## Number of data points in the test set: 54

Summary of Correlation

library(corrplot)
## corrplot 0.92 loaded
num_col <- sapply(train, function(x) is.numeric(x))
num <- train[, num_col]

correlation_matrix <- cor(num, use = "complete.obs")

corrplot(correlation_matrix,
         method = "color",  
         type = "upper", 
         tl.cex = 0.7,   
         tl.col = "black" 
)

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
pairs.panels(num, 
             method = "pearson", 
             hist.col = "lightblue",
             density = TRUE,
             ellipses = TRUE, 
             main = "Correlation Plot with Histograms and Scatter Plots",
             gap = 0
)

1. Parametric Regression Models

a. Simple Linear Regression

Quality.of.Sleep v.s. Age

model <- lm(Age ~ Quality.of.Sleep, data = train)
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Age ~ Quality.of.Sleep, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.569  -6.569  -1.569   6.184  13.444 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        16.514      2.958   5.582 5.94e-08 ***
## Quality.of.Sleep    3.507      0.399   8.790  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.666 on 261 degrees of freedom
## Multiple R-squared:  0.2284, Adjusted R-squared:  0.2255 
## F-statistic: 77.27 on 1 and 261 DF,  p-value: < 2.2e-16
coefficients <- model_summary$coefficients
intercept <- coefficients["(Intercept)", "Estimate"]
slope <- coefficients["Quality.of.Sleep", "Estimate"]

cat("Simple Linear Regression Model Equation: y = ", round(intercept, 2), " + ", round(slope, 2), "x\n")
## Simple Linear Regression Model Equation: y =  16.51  +  3.51 x
residuals <- residuals(model)
SSE <- sum(residuals^2)
TSS <- sum((train$Age - mean(train$Age))^2)
n <- nrow(train)
RSE <- sqrt(SSE / (n - 2))

cat("Residual Standard Error (RSE):", RSE, "\n")
## Residual Standard Error (RSE): 7.665638
cat("Total Sum of Squares (TSS):", TSS, "\n")
## Total Sum of Squares (TSS): 19877.24
coefficient_estimate <- coef(model)["Quality.of.Sleep"]
standard_error <- summary(model)$coef["Quality.of.Sleep", "Std. Error"]

t_statistic <- coefficient_estimate / standard_error

df <- nrow(train) - 2

alpha <- 0.05

p_value <- 2 * (1 - pt(abs(t_statistic), df))

if (p_value < alpha) {
  cat("Reject the null hypothesis. The coefficient is statistically significant.\n")
} else {
  cat("Fail to reject the null hypothesis. The coefficient is not statistically significant.\n")
}
## Reject the null hypothesis. The coefficient is statistically significant.
plot(train$Quality.of.Sleep, train$Age, main = "Simple Linear Regression (Training Set)", xlab = "X", ylab = "Y")
abline(model, col = "red")

Test against Validation Set

predictors <- attr(model$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted_age <- predict(model, newdata = valid)
mse <- mean((valid$Age - predicted_age)^2)
rmse <- sqrt(mse)
n <- nrow(valid)
k <- number_of_predictors
r_squared <- 1 - (sum((valid$Age - predicted_age)^2) / sum((valid$Age - mean(valid$Age))^2))
adjusted_r_squared <- 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

SSE_model <- sum(model$residuals^2)
n_model <- length(model$residuals)
p_model <- number_of_predictors
Cp_model <- (SSE_model / mse) - (n_model - 2 * p_model)
AIC_model <- n_model * log(SSE_model / n_model) + 2 * (p_model + 1)
BIC_model <- n_model * log(SSE_model / n_model) + (p_model + 1) * log(n_model)

cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 59.1125
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 7.688466
cat("Adjusted R-squared (Adjusted R^2):", adjusted_r_squared, "\n")
## Adjusted R-squared (Adjusted R^2): 0.2215862
cat("Number of Predictors in the Model:", p_model, "\n")
## Number of Predictors in the Model: 1
cat("Mallow's Cp for the Model:", Cp_model, "\n")
## Mallow's Cp for the Model: -1.547553
cat("AIC for the Model:", AIC_model, "\n")
## AIC for the Model: 1073.322
cat("BIC for the Model:", BIC_model, "\n")
## BIC for the Model: 1080.466
plot(valid$Quality.of.Sleep, valid$Age, main = "Simple Linear Regression (Validation Set)", xlab = "X", ylab = "Y")
abline(model, col = "blue")
points(valid$Quality.of.Sleep, predicted_age, col = "red", pch = 20)

Test against Test Set

predicted_age_test <- predict(model, newdata = test)
mse_test <- mean((test$Age - predicted_age_test)^2)
rmse_test <- sqrt(mse_test)
n_test <- nrow(test)
k_test <- length(predictors)

r_squared_test <- 1 - (sum((test$Age - predicted_age_test)^2) / sum((test$Age - mean(test$Age))^2))
adjusted_r_squared_test <- 1 - (1 - r_squared_test) * (n_test - 1) / (n_test - k_test - 1)

SSE_model_test <- sum((test$Age - predicted_age_test)^2)
n_model_test <- n_test
p_model_test <- k_test
Cp_model_test <- (SSE_model_test / mse_test) - (n_model_test - 2 * p_model_test)
AIC_model_test <- n_model_test * log(SSE_model_test / n_model_test) + 2 * (p_model_test + 1)
BIC_model_test <- n_model_test * log(SSE_model_test / n_model_test) + (p_model_test + 1) * log(n_model_test)

cat("Mean Squared Error (MSE) on the test set:", mse_test, "\n")
## Mean Squared Error (MSE) on the test set: 56.63359
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 7.525529
cat("Adjusted R-squared (Adjusted R^2) on the test set:", adjusted_r_squared_test, "\n")
## Adjusted R-squared (Adjusted R^2) on the test set: 0.1689779
cat("Number of Predictors in the Model on the test set:", p_model_test, "\n")
## Number of Predictors in the Model on the test set: 1
cat("Mallow's Cp for the Model on the test set:", Cp_model_test, "\n")
## Mallow's Cp for the Model on the test set: 2
cat("AIC for the Model on the test set:", AIC_model_test, "\n")
## AIC for the Model on the test set: 221.9765
cat("BIC for the Model on the test set:", BIC_model_test, "\n")
## BIC for the Model on the test set: 225.9545
plot(test$Quality.of.Sleep, test$Age, main = "Simple Linear Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Age")
abline(model, col = "blue")
points(test$Quality.of.Sleep, predicted_age_test, col = "red", pch = 20)

From the above plots, we can see that the quality of sleep and age have a positive correlation. The values of RSE, RMSE, and adjusted R-squared for the training, validation, and test sets suggest that the model’s performance is somewhat consistent across these different datasets. The model has relatively high RSE values, suggesting that the model’s predictions have a considerable amount of error. The adjusted R-squared values for both validation and test sets are relatively low, indicating that the model’s ability to explain the variance in Age is limited.

Improvements to consider:

The model may benefit from additional predictor variables or a more complex model to better explain the variance in Age.

Quality.of.Sleep v.s. Sleep.Duration

model <- lm(Sleep.Duration ~ Quality.of.Sleep, data = train)
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Sleep.Duration ~ Quality.of.Sleep, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4384 -0.3229 -0.1075  0.2771  0.9616 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.84668    0.14627   19.46   <2e-16 ***
## Quality.of.Sleep  0.58453    0.01973   29.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.379 on 261 degrees of freedom
## Multiple R-squared:  0.7708, Adjusted R-squared:   0.77 
## F-statistic:   878 on 1 and 261 DF,  p-value: < 2.2e-16
coefficients <- model_summary$coefficients

intercept <- coefficients["(Intercept)", "Estimate"]
slope <- coefficients["Quality.of.Sleep", "Estimate"]

cat("Simple Linear Regression Model Equation: y = ", round(intercept, 2), " + ", round(slope, 2), "x\n")
## Simple Linear Regression Model Equation: y =  2.85  +  0.58 x
residuals <- residuals(model)

SSE <- sum(residuals^2)

TSS <- sum((train$Sleep.Duration - mean(train$Sleep.Duration))^2)

n <- nrow(train)
RSE <- sqrt(SSE / (n - 2))

cat("Residual Standard Error (RSE):", RSE, "\n")
## Residual Standard Error (RSE): 0.3790349
cat("Total Sum of Squares (TSS):", TSS, "\n")
## Total Sum of Squares (TSS): 163.6344
coefficient_estimate <- coef(model)["Quality.of.Sleep"]
standard_error <- summary(model)$coef["Quality.of.Sleep", "Std. Error"]

t_statistic <- coefficient_estimate / standard_error

df <- nrow(train) - 2

alpha <- 0.05

p_value <- 2 * (1 - pt(abs(t_statistic), df))

if (p_value < alpha) {
  cat("Reject the null hypothesis. The coefficient is statistically significant.\n")
} else {
  cat("Fail to reject the null hypothesis. The coefficient is not statistically significant.\n")
}
## Reject the null hypothesis. The coefficient is statistically significant.
plot(train$Quality.of.Sleep, train$Sleep.Duration, main = "Simple Linear Regression (Training Set)", xlab = "X", ylab = "Y")

abline(model, col = "red")

Test against Validation Set

predictors <- attr(model$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted_age <- predict(model, newdata = valid)
mse <- mean((valid$Sleep.Duration - predicted_age)^2)
rmse <- sqrt(mse)
n <- nrow(valid)
k <- number_of_predictors
r_squared <- 1 - (sum((valid$Sleep.Duration - predicted_age)^2) / sum((valid$Sleep.Duration - mean(valid$Sleep.Duration))^2))
adjusted_r_squared <- 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

SSE_model <- sum(model$residuals^2)
n_model <- length(model$residuals)
p_model <- number_of_predictors
Cp_model <- (SSE_model / mse) - (n_model - 2 * p_model)
AIC_model <- n_model * log(SSE_model / n_model) + 2 * (p_model + 1)
BIC_model <- n_model * log(SSE_model / n_model) + (p_model + 1) * log(n_model)

cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 0.1295848
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 0.3599789
cat("Adjusted R-squared (Adjusted R^2):", adjusted_r_squared, "\n")
## Adjusted R-squared (Adjusted R^2): 0.7972838
cat("Number of Predictors in the Model:", p_model, "\n")
## Number of Predictors in the Model: 1
cat("Mallow's Cp for the Model:", Cp_model, "\n")
## Mallow's Cp for the Model: 28.36433
cat("AIC for the Model:", AIC_model, "\n")
## AIC for the Model: -508.2944
cat("BIC for the Model:", BIC_model, "\n")
## BIC for the Model: -501.1501
plot(valid$Quality.of.Sleep, valid$Sleep.Duration, main = "Simple Linear Regression (Validation Set)", xlab = "X", ylab = "Y")
abline(model, col = "blue")
points(valid$Quality.of.Sleep, predicted_age, col = "red", pch = 20)

Test against Test Set

predicted_test <- predict(model, newdata = test)
mse_test <- mean((test$Sleep.Duration - predicted_test)^2)
rmse_test <- sqrt(mse_test)
n_test <- nrow(test)
k_test <- length(predictors)

r_squared_test <- 1 - (sum((test$Sleep.Duration - predicted_test)^2) / sum((test$Sleep.Duration - mean(test$Sleep.Duration))^2))
adjusted_r_squared_test <- 1 - (1 - r_squared_test) * (n_test - 1) / (n_test - k_test - 1)

SSE_model_test <- sum((test$Sleep.Duration - predicted_test)^2)
n_model_test <- n_test
p_model_test <- k_test
Cp_model_test <- (SSE_model_test / mse_test) - (n_model_test - 2 * p_model_test)
AIC_model_test <- n_model_test * log(SSE_model_test / n_model_test) + 2 * (p_model_test + 1)
BIC_model_test <- n_model_test * log(SSE_model_test / n_model_test) + (p_model_test + 1) * log(n_model_test)

cat("Mean Squared Error (MSE) on the test set:", mse_test, "\n")
## Mean Squared Error (MSE) on the test set: 0.1314462
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 0.3625551
cat("Adjusted R-squared (Adjusted R^2) on the test set:", adjusted_r_squared_test, "\n")
## Adjusted R-squared (Adjusted R^2) on the test set: 0.7950755
cat("Number of Predictors in the Model on the test set:", p_model_test, "\n")
## Number of Predictors in the Model on the test set: 1
cat("Mallow's Cp for the Model on the test set:", Cp_model_test, "\n")
## Mallow's Cp for the Model on the test set: 2
cat("AIC for the Model on the test set:", AIC_model_test, "\n")
## AIC for the Model on the test set: -105.5745
cat("BIC for the Model on the test set:", BIC_model_test, "\n")
## BIC for the Model on the test set: -101.5966
plot(test$Quality.of.Sleep, test$Sleep.Duration, main = "Simple Linear Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Sleep.Duration")
abline(model, col = "blue")
points(test$Quality.of.Sleep, predicted_test, col = "red", pch = 20)

Overall, the model appears to perform well, as indicated by low MSE and RMSE values and high adjusted R-squared values. It also passes the statistical significance test for the coefficient in both training and validation sets. The test set evaluation metrics are consistent with the validation set, indicating that the model’s performance is stable and generalizes well to unseen data. The coefficient and the line indicates that there is a positive correlation between sleep quality and sleep duration.

Quality.of.Sleep v.s. Daily.Steps

model <- lm(Daily.Steps ~ Quality.of.Sleep, data = train)
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Daily.Steps ~ Quality.of.Sleep, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3617.8 -1230.8    99.7  1156.2  3269.2 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6391.88     640.18   9.985   <2e-16 ***
## Quality.of.Sleep    56.49      86.34   0.654    0.514    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1659 on 261 degrees of freedom
## Multiple R-squared:  0.001637,   Adjusted R-squared:  -0.002188 
## F-statistic: 0.428 on 1 and 261 DF,  p-value: 0.5135
coefficients <- model_summary$coefficients

intercept <- coefficients["(Intercept)", "Estimate"]
slope <- coefficients["Quality.of.Sleep", "Estimate"]

cat("Simple Linear Regression Model Equation: y = ", round(intercept, 2), " + ", round(slope, 2), "x\n")
## Simple Linear Regression Model Equation: y =  6391.88  +  56.49 x
residuals <- residuals(model)

SSE <- sum(residuals^2)

TSS <- sum((train$Daily.Steps - mean(train$Daily.Steps))^2)

n <- nrow(train)
RSE <- sqrt(SSE / (n - 2))

cat("Residual Standard Error (RSE):", RSE, "\n")
## Residual Standard Error (RSE): 1658.918
cat("Total Sum of Squares (TSS):", TSS, "\n")
## Total Sum of Squares (TSS): 719452548
coefficient_estimate <- coef(model)["Quality.of.Sleep"]
standard_error <- summary(model)$coef["Quality.of.Sleep", "Std. Error"]

t_statistic <- coefficient_estimate / standard_error

df <- nrow(train) - 2

alpha <- 0.05

p_value <- 2 * (1 - pt(abs(t_statistic), df))

if (p_value < alpha) {
  cat("Reject the null hypothesis. The coefficient is statistically significant.\n")
} else {
  cat("Fail to reject the null hypothesis. The coefficient is not statistically significant.\n")
}
## Fail to reject the null hypothesis. The coefficient is not statistically significant.
plot(train$Quality.of.Sleep, train$Daily.Steps, main = "Simple Linear Regression (Training Set)", xlab = "X", ylab = "Y")
abline(model, col = "red")

Test against Validation Set

predictors <- attr(model$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted_age <- predict(model, newdata = valid)
mse <- mean((valid$Daily.Steps - predicted_age)^2)
rmse <- sqrt(mse)
n <- nrow(valid)
k <- number_of_predictors
r_squared <- 1 - (sum((valid$Daily.Steps - predicted_age)^2) / sum((valid$Daily.Steps - mean(valid$Daily.Steps))^2))
adjusted_r_squared <- 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

SSE_model <- sum(model$residuals^2)
n_model <- length(model$residuals)
p_model <- number_of_predictors
Cp_model <- (SSE_model / mse) - (n_model - 2 * p_model)
AIC_model <- n_model * log(SSE_model / n_model) + 2 * (p_model + 1)
BIC_model <- n_model * log(SSE_model / n_model) + (p_model + 1) * log(n_model)

cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 1987807
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 1409.896
cat("Adjusted R-squared (Adjusted R^2):", adjusted_r_squared, "\n")
## Adjusted R-squared (Adjusted R^2): -0.03501304
cat("Number of Predictors in the Model:", p_model, "\n")
## Number of Predictors in the Model: 1
cat("Mallow's Cp for the Model:", Cp_model, "\n")
## Mallow's Cp for the Model: 100.3402
cat("AIC for the Model:", AIC_model, "\n")
## AIC for the Model: 3901.715
cat("BIC for the Model:", BIC_model, "\n")
## BIC for the Model: 3908.859
plot(valid$Quality.of.Sleep, valid$Daily.Steps, main = "Simple Linear Regression (Validation Set)", xlab = "X", ylab = "Y")
abline(model, col = "blue")
points(valid$Quality.of.Sleep, predicted_age, col = "red", pch = 20)

Test against Test Set

predicted_test <- predict(model, newdata = test)
mse_test <- mean((test$Daily.Steps - predicted_test)^2)
rmse_test <- sqrt(mse_test)
n_test <- nrow(test)
k_test <- length(predictors)

r_squared_test <- 1 - (sum((test$Daily.Steps - predicted_test)^2) / sum((test$Daily.Steps - mean(test$Daily.Steps))^2))
adjusted_r_squared_test <- 1 - (1 - r_squared_test) * (n_test - 1) / (n_test - k_test - 1)

SSE_model_test <- sum((test$Daily.Steps - predicted_test)^2)
n_model_test <- n_test
p_model_test <- k_test
Cp_model_test <- (SSE_model_test / mse_test) - (n_model_test - 2 * p_model_test)
AIC_model_test <- n_model_test * log(SSE_model_test / n_model_test) + 2 * (p_model_test + 1)
BIC_model_test <- n_model_test * log(SSE_model_test / n_model_test) + (p_model_test + 1) * log(n_model_test)

cat("Mean Squared Error (MSE) on the test set:", mse_test, "\n")
## Mean Squared Error (MSE) on the test set: 2688739
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 1639.738
cat("Adjusted R-squared (Adjusted R^2) on the test set:", adjusted_r_squared_test, "\n")
## Adjusted R-squared (Adjusted R^2) on the test set: -0.02255441
cat("Number of Predictors in the Model on the test set:", p_model_test, "\n")
## Number of Predictors in the Model on the test set: 1
cat("Mallow's Cp for the Model on the test set:", Cp_model_test, "\n")
## Mallow's Cp for the Model on the test set: 2
cat("AIC for the Model on the test set:", AIC_model_test, "\n")
## AIC for the Model on the test set: 803.4475
cat("BIC for the Model on the test set:", BIC_model_test, "\n")
## BIC for the Model on the test set: 807.4254
plot(test$Quality.of.Sleep, test$Daily.Steps, main = "Simple Linear Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Daily.Steps")
abline(model, col = "blue")
points(test$Quality.of.Sleep, predicted_test, col = "red", pch = 20)

The evaluation metrics for both the validation and test sets suggest that the model of sleep quality and daily steps does not fit the data well and has high prediction errors. The negative adjusted R-squared values indicate that the predictors are not explaining the variance in the response variable. The high RMSE values and AIC & BIC values on the validation set and test set further support that the model’s performance is not adequate. The plot indicates a positive correlation, yet the coefficients are not statistically significant, and the model has a high level of prediction error.

Quality.of.Sleep v.s. Physical.Activity.Level

model <- lm(Physical.Activity.Level ~ Quality.of.Sleep, data = train)
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = Physical.Activity.Level ~ Quality.of.Sleep, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.294 -13.080  -1.187  16.920  35.027 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        36.333      7.814   4.650 5.28e-06 ***
## Quality.of.Sleep    3.107      1.054   2.948  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.25 on 261 degrees of freedom
## Multiple R-squared:  0.03223,    Adjusted R-squared:  0.02852 
## F-statistic: 8.691 on 1 and 261 DF,  p-value: 0.003487
coefficients <- model_summary$coefficients

intercept <- coefficients["(Intercept)", "Estimate"]
slope <- coefficients["Quality.of.Sleep", "Estimate"]

cat("Simple Linear Regression Model Equation: y = ", round(intercept, 2), " + ", round(slope, 2), "x\n")
## Simple Linear Regression Model Equation: y =  36.33  +  3.11 x
residuals <- residuals(model)

SSE <- sum(residuals^2)

TSS <- sum((train$Physical.Activity.Level - mean(train$Physical.Activity.Level))^2)

n <- nrow(train)
RSE <- sqrt(SSE / (n - 2))

cat("Residual Standard Error (RSE):", RSE, "\n")
## Residual Standard Error (RSE): 20.24773
cat("Total Sum of Squares (TSS):", TSS, "\n")
## Total Sum of Squares (TSS): 110565.6
coefficient_estimate <- coef(model)["Quality.of.Sleep"]
standard_error <- summary(model)$coef["Quality.of.Sleep", "Std. Error"]

t_statistic <- coefficient_estimate / standard_error

df <- nrow(train) - 2

alpha <- 0.05

p_value <- 2 * (1 - pt(abs(t_statistic), df))

if (p_value < alpha) {
  cat("Reject the null hypothesis. The coefficient is statistically significant.\n")
} else {
  cat("Fail to reject the null hypothesis. The coefficient is not statistically significant.\n")
}
## Reject the null hypothesis. The coefficient is statistically significant.
plot(train$Quality.of.Sleep, train$Physical.Activity.Level, main = "Simple Linear Regression (Training Set)", xlab = "X", ylab = "Y")

abline(model, col = "red")

Test against Validation Set

predictors <- attr(model$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted_age <- predict(model, newdata = valid)
mse <- mean((valid$Physical.Activity.Level - predicted_age)^2)
rmse <- sqrt(mse)
n <- nrow(valid)
k <- number_of_predictors
r_squared <- 1 - (sum((valid$Physical.Activity.Level - predicted_age)^2) / sum((valid$Physical.Activity.Level - mean(valid$Physical.Activity.Level))^2))
adjusted_r_squared <- 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

SSE_model <- sum(model$residuals^2)
n_model <- length(model$residuals)
p_model <- number_of_predictors
Cp_model <- (SSE_model / mse) - (n_model - 2 * p_model)
AIC_model <- n_model * log(SSE_model / n_model) + 2 * (p_model + 1)
BIC_model <- n_model * log(SSE_model / n_model) + (p_model + 1) * log(n_model)

cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 437.0631
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 20.90605
cat("Adjusted R-squared (Adjusted R^2):", adjusted_r_squared, "\n")
## Adjusted R-squared (Adjusted R^2): 0.021927
cat("Number of Predictors in the Model:", p_model, "\n")
## Number of Predictors in the Model: 1
cat("Mallow's Cp for the Model:", Cp_model, "\n")
## Mallow's Cp for the Model: -16.17865
cat("AIC for the Model:", AIC_model, "\n")
## AIC for the Model: 1584.223
cat("BIC for the Model:", BIC_model, "\n")
## BIC for the Model: 1591.367
plot(valid$Quality.of.Sleep, valid$Physical.Activity.Level, main = "Simple Linear Regression (Validation Set)", xlab = "X", ylab = "Y")
abline(model, col = "blue")
points(valid$Quality.of.Sleep, predicted_age, col = "red", pch = 20)

Test against Test Set

predicted_test <- predict(model, newdata = test)
mse_test <- mean((test$Physical.Activity.Level - predicted_test)^2)
rmse_test <- sqrt(mse_test)
n_test <- nrow(test)
k_test <- length(predictors)

r_squared_test <- 1 - (sum((test$Physical.Activity.Level - predicted_test)^2) / sum((test$Physical.Activity.Level - mean(test$Physical.Activity.Level))^2))
adjusted_r_squared_test <- 1 - (1 - r_squared_test) * (n_test - 1) / (n_test - k_test - 1)

SSE_model_test <- sum((test$Physical.Activity.Level - predicted_test)^2)
n_model_test <- n_test
p_model_test <- k_test
Cp_model_test <- (SSE_model_test / mse_test) - (n_model_test - 2 * p_model_test)
AIC_model_test <- n_model_test * log(SSE_model_test / n_model_test) + 2 * (p_model_test + 1)
BIC_model_test <- n_model_test * log(SSE_model_test / n_model_test) + (p_model_test + 1) * log(n_model_test)

cat("Mean Squared Error (MSE) on the test set:", mse_test, "\n")
## Mean Squared Error (MSE) on the test set: 443.6013
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 21.06184
cat("Adjusted R-squared (Adjusted R^2) on the test set:", adjusted_r_squared_test, "\n")
## Adjusted R-squared (Adjusted R^2) on the test set: 0.03665047
cat("Number of Predictors in the Model on the test set:", p_model_test, "\n")
## Number of Predictors in the Model on the test set: 1
cat("Mallow's Cp for the Model on the test set:", Cp_model_test, "\n")
## Mallow's Cp for the Model on the test set: 2
cat("AIC for the Model on the test set:", AIC_model_test, "\n")
## AIC for the Model on the test set: 333.126
cat("BIC for the Model on the test set:", BIC_model_test, "\n")
## BIC for the Model on the test set: 337.104
plot(test$Quality.of.Sleep, test$Physical.Activity.Level, main = "Simple Linear Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Physical.Activity.Level")
abline(model, col = "blue")
points(test$Quality.of.Sleep, predicted_test, col = "red", pch = 20)

This simple linear regression model has a low adjusted R-squared on both the training and validation sets, suggesting that it explains only a small portion of the variance in the physical activity level variable. While the coefficient is statistically significant and indicates a positive correlation, the model still has relatively high prediction errors, as indicated by the RSE, MSE, and RMSE values. Additionally, Mallow’s Cp, AIC, and BIC values suggest that the model may be too complex. The adjusted R-squared on the test set is negative, indicating that the model’s performance is poor on unseen data. Overall, this model does not provide a good fit to the data.

b. Polynomial Linear Regression

Quality.of.Sleep v.s. Age

poly_model <- lm(train$Age ~ poly(train$Quality.of.Sleep, 2), data = train)
model_summary <- summary(poly_model)
print(model_summary)
## 
## Call:
## lm(formula = train$Age ~ poly(train$Quality.of.Sleep, 2), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.1640  -5.9746  -0.4668   5.6019  13.6019 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       42.1825     0.4397  95.929  < 2e-16 ***
## poly(train$Quality.of.Sleep, 2)1  67.3822     7.1311   9.449  < 2e-16 ***
## poly(train$Quality.of.Sleep, 2)2  45.9894     7.1311   6.449 5.45e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.131 on 260 degrees of freedom
## Multiple R-squared:  0.3348, Adjusted R-squared:  0.3297 
## F-statistic: 65.44 on 2 and 260 DF,  p-value: < 2.2e-16
coefficients <- coef(poly_model)
coef_quadratic <- coefficients["poly(train$Quality.of.Sleep, 2)1"]
coef_linear <- coefficients["poly(train$Quality.of.Sleep, 2)2"]
intercept <- coefficients["(Intercept)"]

cat("Polynomial Linear Regression Model Equation: y = ", round(intercept, 2), " + ", round(coef_linear, 2), "x + ", round(coef_quadratic, 2), "x^2\n")
## Polynomial Linear Regression Model Equation: y =  42.18  +  45.99 x +  67.38 x^2
RSS <- sum(poly_model$residuals^2)
TSS <- sum((train$Age - mean(train$Age))^2)
p <- length(coef(poly_model))
n <- nrow(train)
F_statistic <- ((TSS - RSS) / (p - 1)) / (RSS / (n - p))
df1 <- p - 1
df2 <- n - p
alpha <- 0.05
p_value <- 1 - pf(F_statistic, df1, df2)

if (p_value < alpha) {
  cat("Reject the null hypothesis. The polynomial regression model is statistically significant.\n")
} else {
  cat("Fail to reject the null hypothesis. The polynomial regression model is not statistically significant.\n")
}
## Reject the null hypothesis. The polynomial regression model is statistically significant.
plot(train$Quality.of.Sleep, train$Age, main = "Polynomial Regression (Training Set)", xlab = "X", ylab = "Y")
xseq <- seq(min(train$Quality.of.Sleep), max(train$Quality.of.Sleep), length.out = length(train$Quality.of.Sleep))
yhat <- predict(poly_model, newdata = data.frame(Quality.of.Sleep = xseq))
lines(xseq, yhat, col = "red")

Test against Validation Set

poly_model <- lm(valid$Age ~ poly(valid$Quality.of.Sleep, 2), data = valid)
predictors <- attr(poly_model$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted <- predict(poly_model, newdata = valid)
mse <- mean((valid$Age - predicted)^2)
rmse <- sqrt(mse)
n <- nrow(valid)
k <- number_of_predictors
r_squared <- 1 - (sum((valid$Age - predicted)^2) / sum((valid$Age - mean(valid$Age))^2))
adjusted_r_squared <- 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

SSE_poly <- sum(poly_model$residuals^2)
n_poly <- length(poly_model$residuals)
p_poly <- number_of_predictors
Cp_poly <- (SSE_poly / mse) - (n_poly - 2 * p_poly)
AIC_poly <- n_poly * log(SSE_poly / n_poly) + 2 * (p_poly + 1)
BIC_poly <- n_poly * log(SSE_poly / n_poly) + (p_poly + 1) * log(n_poly)

cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 33.95688
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 5.827253
cat("Adjusted R-squared (Adjusted R^2):", adjusted_r_squared, "\n")
## Adjusted R-squared (Adjusted R^2): 0.5528441
cat("Number of Predictors in the Polynomial Model:", p_poly, "\n")
## Number of Predictors in the Polynomial Model: 1
cat("Mallow's Cp for the Polynomial Model:", Cp_poly, "\n")
## Mallow's Cp for the Polynomial Model: 2
cat("AIC for the Polynomial Model:", AIC_poly, "\n")
## AIC for the Polynomial Model: 204.9302
cat("BIC for the Polynomial Model:", BIC_poly, "\n")
## BIC for the Polynomial Model: 209.0163
plot(valid$Quality.of.Sleep, valid$Age, main = "Polynomial Regression (Validation Set)", xlab = "X", ylab = "Y")
lines(xseq, yhat, col = "blue")
points(valid$Quality.of.Sleep, predicted, col = "red", pch = 20)

Test against Test Set

poly_model_test <- lm(test$Age ~ poly(test$Quality.of.Sleep, 2), data = test)
predictors <- attr(poly_model_test$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted_test <- predict(poly_model_test, newdata = test)
mse_test <- mean((test$Age - predicted_test)^2)
rmse_test <- sqrt(mse_test)
n_test <- nrow(test)
k_test <- number_of_predictors
r_squared_test <- 1 - (sum((test$Age - predicted_test)^2) / sum((test$Age - mean(test$Age))^2))
adjusted_r_squared_test <- 1 - (1 - r_squared_test) * (n_test - 1) / (n_test - k_test - 1)
SSE_poly_test <- sum(poly_model_test$residuals^2)
n_poly_test <- length(poly_model_test$residuals)
Cp_poly_test <- (SSE_poly_test / mse_test) - (n_poly_test - 2 * k_test)
AIC_poly_test <- n_poly_test * log(SSE_poly_test / n_poly_test) + 2 * (k_test + 1)
BIC_poly_test <- n_poly_test * log(SSE_poly_test / n_poly_test) + (k_test + 1) * log(n_poly_test)

cat("Mean Squared Error (MSE) on the test set:", mse_test, "\n")
## Mean Squared Error (MSE) on the test set: 53.28474
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 7.29964
cat("Adjusted R-squared (Adjusted R^2) on the test set:", adjusted_r_squared_test, "\n")
## Adjusted R-squared (Adjusted R^2) on the test set: 0.2181177
cat("Number of Predictors in the Polynomial Model on the test set:", k_test, "\n")
## Number of Predictors in the Polynomial Model on the test set: 1
cat("Mallow's Cp for the Polynomial Model on the test set:", Cp_poly_test, "\n")
## Mallow's Cp for the Polynomial Model on the test set: 2
cat("AIC for the Polynomial Model on the test set:", AIC_poly_test, "\n")
## AIC for the Polynomial Model on the test set: 218.6851
cat("BIC for the Polynomial Model on the test set:", BIC_poly_test, "\n")
## BIC for the Polynomial Model on the test set: 222.6631
plot(test$Quality.of.Sleep, test$Age, main = "Polynomial Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Age")
lines(xseq, yhat, col = "blue")
points(test$Quality.of.Sleep, predicted_test, col = "red", pch = 20)

Due to the structure of the data, the line of the polynomial regression model is like a zig-zag shape, instead of a smooth curve. This indicates that perhaps this is not the perfect regression model for this data. However, the evaluation metrics for the model suggest that the it has a moderate fit and performs well on both the validation and test sets. The adjusted R-squared values, although not high, indicate that the model explains a significant portion of the variance in the response variable. The MSE and RMSE values on the validation and test sets are also relatively moderate, suggesting acceptable model performance.

Quality.of.Sleep v.s. Sleep.Duration

poly_model <- lm(train$Sleep.Duration ~ poly(train$Quality.of.Sleep, 2), data = train)
model_summary <- summary(poly_model)
print(model_summary)
## 
## Call:
## lm(formula = train$Sleep.Duration ~ poly(train$Quality.of.Sleep, 
##     2), data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3677 -0.2677 -0.1250  0.2037  1.0539 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       7.12510    0.02261 315.132  < 2e-16 ***
## poly(train$Quality.of.Sleep, 2)1 11.23108    0.36667  30.630  < 2e-16 ***
## poly(train$Quality.of.Sleep, 2)2  1.59400    0.36667   4.347 1.98e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3667 on 260 degrees of freedom
## Multiple R-squared:  0.7864, Adjusted R-squared:  0.7847 
## F-statistic: 478.5 on 2 and 260 DF,  p-value: < 2.2e-16
coefficients <- coef(poly_model)
coef_quadratic <- coefficients["poly(train$Quality.of.Sleep, 2)1"]
coef_linear <- coefficients["poly(train$Quality.of.Sleep, 2)2"]
intercept <- coefficients["(Intercept)"]

cat("Polynomial Linear Regression Model Equation: y = ", round(intercept, 2), " + ", round(coef_linear, 2), "x + ", round(coef_quadratic, 2), "x^2\n")
## Polynomial Linear Regression Model Equation: y =  7.13  +  1.59 x +  11.23 x^2
RSS <- sum(poly_model$residuals^2)
TSS <- sum((train$Sleep.Duration - mean(train$Sleep.Duration))^2)
p <- length(coef(poly_model))
n <- nrow(train)
F_statistic <- ((TSS - RSS) / (p - 1)) / (RSS / (n - p))
df1 <- p - 1
df2 <- n - p
alpha <- 0.05
p_value <- 1 - pf(F_statistic, df1, df2)

if (p_value < alpha) {
  cat("Reject the null hypothesis. The polynomial regression model is statistically significant.\n")
} else {
  cat("Fail to reject the null hypothesis. The polynomial regression model is not statistically significant.\n")
}
## Reject the null hypothesis. The polynomial regression model is statistically significant.
plot(train$Quality.of.Sleep, train$Sleep.Duration, main = "Polynomial Regression", xlab = "Quality.of.Sleep", ylab = "Sleep.Duration")
xseq <- seq(min(train$Quality.of.Sleep), max(train$Quality.of.Sleep), length.out = length(train$Quality.of.Sleep))
yhat <- predict(poly_model, newdata = data.frame(Quality.of.Sleep = xseq))
lines(xseq, yhat, col = "red")

Test against Validation Set

poly_model <- lm(valid$Sleep.Duration ~ poly(valid$Quality.of.Sleep, 2), data = valid)
predictors <- attr(poly_model$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted <- predict(poly_model, newdata = valid)
mse <- mean((valid$Sleep.Duration - predicted)^2)
rmse <- sqrt(mse)
n <- nrow(valid)
k <- number_of_predictors
r_squared <- 1 - (sum((valid$Sleep.Duration - predicted)^2) / sum((valid$Sleep.Duration - mean(valid$Sleep.Duration))^2))
adjusted_r_squared <- 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

SSE_poly <- sum(poly_model$residuals^2)
n_poly <- length(poly_model$residuals)
p_poly <- number_of_predictors
Cp_poly <- (SSE_poly / mse) - (n_poly - 2 * p_poly)
AIC_poly <- n_poly * log(SSE_poly / n_poly) + 2 * (p_poly + 1)
BIC_poly <- n_poly * log(SSE_poly / n_poly) + (p_poly + 1) * log(n_poly)

cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 0.1200596
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 0.3464961
cat("Adjusted R-squared (Adjusted R^2):", adjusted_r_squared, "\n")
## Adjusted R-squared (Adjusted R^2): 0.8121846
cat("Number of Predictors in the Polynomial Model:", p_poly, "\n")
## Number of Predictors in the Polynomial Model: 1
cat("Mallow's Cp for the Polynomial Model:", Cp_poly, "\n")
## Mallow's Cp for the Polynomial Model: 2
cat("AIC for the Polynomial Model:", AIC_poly, "\n")
## AIC for the Polynomial Model: -116.8267
cat("BIC for the Polynomial Model:", BIC_poly, "\n")
## BIC for the Polynomial Model: -112.7406
plot(valid$Quality.of.Sleep, valid$Sleep.Duration, main = "Polynomial Regression (Validation Set)", xlab = "Quality.of.Sleep", ylab = "Sleep.Duration")
lines(xseq, yhat, col = "blue")
points(valid$Quality.of.Sleep, predicted, col = "red", pch = 20)

Test against Test Set

poly_model_test <- lm(test$Sleep.Duration ~ poly(test$Quality.of.Sleep, 2), data = test)
predictors <- attr(poly_model_test$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted_test <- predict(poly_model_test, newdata = test)
mse_test <- mean((test$Sleep.Duration - predicted_test)^2)
rmse_test <- sqrt(mse_test)
n_test <- nrow(test)
k_test <- number_of_predictors
r_squared_test <- 1 - (sum((test$Sleep.Duration - predicted_test)^2) / sum((test$Sleep.Duration - mean(test$Sleep.Duration))^2))
adjusted_r_squared_test <- 1 - (1 - r_squared_test) * (n_test - 1) / (n_test - k_test - 1)
SSE_poly_test <- sum(poly_model_test$residuals^2)
n_poly_test <- length(poly_model_test$residuals)
Cp_poly_test <- (SSE_poly_test / mse_test) - (n_poly_test - 2 * k_test)
AIC_poly_test <- n_poly_test * log(SSE_poly_test / n_poly_test) + 2 * (k_test + 1)
BIC_poly_test <- n_poly_test * log(SSE_poly_test / n_poly_test) + (k_test + 1) * log(n_poly_test)

cat("Mean Squared Error (MSE) on the test set:", mse_test, "\n")
## Mean Squared Error (MSE) on the test set: 0.1121022
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 0.3348167
cat("Adjusted R-squared (Adjusted R^2) on the test set:", adjusted_r_squared_test, "\n")
## Adjusted R-squared (Adjusted R^2) on the test set: 0.8252327
cat("Number of Predictors in the Polynomial Model on the test set:", k_test, "\n")
## Number of Predictors in the Polynomial Model on the test set: 1
cat("Mallow's Cp for the Polynomial Model on the test set:", Cp_poly_test, "\n")
## Mallow's Cp for the Polynomial Model on the test set: 2
cat("AIC for the Polynomial Model on the test set:", AIC_poly_test, "\n")
## AIC for the Polynomial Model on the test set: -114.1706
cat("BIC for the Polynomial Model on the test set:", BIC_poly_test, "\n")
## BIC for the Polynomial Model on the test set: -110.1926
plot(test$Quality.of.Sleep, test$Sleep.Duration, main = "Polynomial Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Sleep.Duration")
lines(xseq, yhat, col = "blue")
points(test$Quality.of.Sleep, predicted_test, col = "red", pch = 20)

Same as the previous model, due to the structure of the data, the line of the polynomial regression model is like a zig-zag shape, instead of a smooth curve. This indicates that perhaps this is not the perfect regression model for this data. Nevertheless, the evaluation metrics for this model suggest that the it’s a good fit for the data. The adjusted R-squared values are high, indicating that the model explains a significant portion of the variance in the response variable. The MSE and RMSE values on the validation and test sets are low, indicating great prediction accuracy and a good fit for these datasets.

Quality.of.Sleep v.s. Daily.Steps

poly_model <- lm(train$Daily.Steps ~ poly(train$Quality.of.Sleep, 2), data = train)
model_summary <- summary(poly_model)
print(model_summary)
## 
## Call:
## lm(formula = train$Daily.Steps ~ poly(train$Quality.of.Sleep, 
##     2), data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4003.4 -1243.4  -152.7   847.3  3756.6 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       6805.32      96.66  70.404  < 2e-16 ***
## poly(train$Quality.of.Sleep, 2)1  1085.30    1567.58   0.692    0.489    
## poly(train$Quality.of.Sleep, 2)2 -8909.37    1567.58  -5.684 3.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1568 on 260 degrees of freedom
## Multiple R-squared:  0.112,  Adjusted R-squared:  0.1051 
## F-statistic: 16.39 on 2 and 260 DF,  p-value: 1.976e-07
coefficients <- coef(poly_model)
coef_quadratic <- coefficients["poly(train$Quality.of.Sleep, 2)1"]
coef_linear <- coefficients["poly(train$Quality.of.Sleep, 2)2"]
intercept <- coefficients["(Intercept)"]

cat("Polynomial Linear Regression Model Equation: y = ", round(intercept, 2), " + ", round(coef_linear, 2), "x + ", round(coef_quadratic, 2), "x^2\n")
## Polynomial Linear Regression Model Equation: y =  6805.32  +  -8909.37 x +  1085.3 x^2
RSS <- sum(poly_model$residuals^2)
TSS <- sum((train$Daily.Steps - mean(train$Daily.Steps))^2)
p <- length(coef(poly_model))
n <- nrow(train)
F_statistic <- ((TSS - RSS) / (p - 1)) / (RSS / (n - p))
df1 <- p - 1
df2 <- n - p
alpha <- 0.05
p_value <- 1 - pf(F_statistic, df1, df2)

if (p_value < alpha) {
  cat("Reject the null hypothesis. The polynomial regression model is statistically significant.\n")
} else {
  cat("Fail to reject the null hypothesis. The polynomial regression model is not statistically significant.\n")
}
## Reject the null hypothesis. The polynomial regression model is statistically significant.
plot(train$Quality.of.Sleep, train$Daily.Steps, main = "Polynomial Regression", xlab = "Quality.of.Sleep", ylab = "Daily.Steps")
xseq <- seq(min(train$Quality.of.Sleep), max(train$Quality.of.Sleep), length.out = length(train$Quality.of.Sleep))
yhat <- predict(poly_model, newdata = data.frame(Quality.of.Sleep = xseq))
lines(xseq, yhat, col = "red")

Test against Validation Set

poly_model <- lm(valid$Daily.Steps ~ poly(valid$Quality.of.Sleep, 2), data = valid)
predictors <- attr(poly_model$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted <- predict(poly_model, newdata = valid)
mse <- mean((valid$Daily.Steps - predicted)^2)
rmse <- sqrt(mse)
n <- nrow(valid)
k <- number_of_predictors
r_squared <- 1 - (sum((valid$Daily.Steps - predicted)^2) / sum((valid$Daily.Steps - mean(valid$Daily.Steps))^2))
adjusted_r_squared <- 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

SSE_poly <- sum(poly_model$residuals^2)
n_poly <- length(poly_model$residuals)
p_poly <- number_of_predictors
Cp_poly <- (SSE_poly / mse) - (n_poly - 2 * p_poly)
AIC_poly <- n_poly * log(SSE_poly / n_poly) + 2 * (p_poly + 1)
BIC_poly <- n_poly * log(SSE_poly / n_poly) + (p_poly + 1) * log(n_poly)

cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 1802816
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 1342.69
cat("Adjusted R-squared (Adjusted R^2):", adjusted_r_squared, "\n")
## Adjusted R-squared (Adjusted R^2): 0.06130834
cat("Number of Predictors in the Polynomial Model:", p_poly, "\n")
## Number of Predictors in the Polynomial Model: 1
cat("Mallow's Cp for the Polynomial Model:", Cp_poly, "\n")
## Mallow's Cp for the Polynomial Model: 2
cat("AIC for the Polynomial Model:", AIC_poly, "\n")
## AIC for the Polynomial Model: 825.077
cat("BIC for the Polynomial Model:", BIC_poly, "\n")
## BIC for the Polynomial Model: 829.1631
plot(valid$Quality.of.Sleep, valid$Daily.Steps, main = "Polynomial Regression (Validation Set)", xlab = "Quality.of.Sleep", ylab = "Daily.Steps")
lines(xseq, yhat, col = "blue")
points(valid$Quality.of.Sleep, predicted, col = "red", pch = 20)

Test against Test Set

poly_model_test <- lm(test$Daily.Steps ~ poly(test$Quality.of.Sleep, 2), data = test)
predictors <- attr(poly_model_test$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted_test <- predict(poly_model_test, newdata = test)
mse_test <- mean((test$Daily.Steps - predicted_test)^2)
rmse_test <- sqrt(mse_test)
n_test <- nrow(test)
k_test <- number_of_predictors
r_squared_test <- 1 - (sum((test$Daily.Steps - predicted_test)^2) / sum((test$Daily.Steps - mean(test$Daily.Steps))^2))
adjusted_r_squared_test <- 1 - (1 - r_squared_test) * (n_test - 1) / (n_test - k_test - 1)
SSE_poly_test <- sum(poly_model_test$residuals^2)
n_poly_test <- length(poly_model_test$residuals)
Cp_poly_test <- (SSE_poly_test / mse_test) - (n_poly_test - 2 * k_test)
AIC_poly_test <- n_poly_test * log(SSE_poly_test / n_poly_test) + 2 * (k_test + 1)
BIC_poly_test <- n_poly_test * log(SSE_poly_test / n_poly_test) + (k_test + 1) * log(n_poly_test)

cat("Mean Squared Error (MSE) on the test set:", mse_test, "\n")
## Mean Squared Error (MSE) on the test set: 2028710
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 1424.328
cat("Adjusted R-squared (Adjusted R^2) on the test set:", adjusted_r_squared_test, "\n")
## Adjusted R-squared (Adjusted R^2) on the test set: 0.2284612
cat("Number of Predictors in the Polynomial Model on the test set:", k_test, "\n")
## Number of Predictors in the Polynomial Model on the test set: 1
cat("Mallow's Cp for the Polynomial Model on the test set:", Cp_poly_test, "\n")
## Mallow's Cp for the Polynomial Model on the test set: 2
cat("AIC for the Polynomial Model on the test set:", AIC_poly_test, "\n")
## AIC for the Polynomial Model on the test set: 788.2372
cat("BIC for the Polynomial Model on the test set:", BIC_poly_test, "\n")
## BIC for the Polynomial Model on the test set: 792.2152
plot(test$Quality.of.Sleep, test$Daily.Steps, main = "Polynomial Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Daily.Steps")
lines(xseq, yhat, col = "blue")
points(test$Quality.of.Sleep, predicted_test, col = "red", pch = 20)

Due to the structure of the data, the line of the polynomial regression model is like a zig-zag shape, instead of a smooth curve. This indicates that perhaps this is not the perfect regression model for this data. However, by looking at the evaluation metrics, it appears that the model has a moderate level of explanatory power and prediction accuracy, as indicated by the low adjusted R-squared, and high MSE, RMSE values. The model is statistically significant, but its performance on the validation and test sets suggests that it may not be the best fit for the data.

Quality.of.Sleep v.s. Physical.Activity.Level

poly_model <- lm(train$Physical.Activity.Level ~ poly(train$Quality.of.Sleep, 2), data = train)
model_summary <- summary(poly_model)
print(model_summary)
## 
## Call:
## lm(formula = train$Physical.Activity.Level ~ poly(train$Quality.of.Sleep, 
##     2), data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.549 -17.549  -3.862  16.394  35.332 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        59.072      1.216  48.593  < 2e-16 ***
## poly(train$Quality.of.Sleep, 2)1   59.693     19.714   3.028 0.002710 ** 
## poly(train$Quality.of.Sleep, 2)2  -77.146     19.714  -3.913 0.000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.71 on 260 degrees of freedom
## Multiple R-squared:  0.08605,    Adjusted R-squared:  0.07902 
## F-statistic: 12.24 on 2 and 260 DF,  p-value: 8.31e-06
coefficients <- coef(poly_model)
coef_quadratic <- coefficients["poly(train$Quality.of.Sleep, 2)1"]
coef_linear <- coefficients["poly(train$Quality.of.Sleep, 2)2"]
intercept <- coefficients["(Intercept)"]

cat("Polynomial Linear Regression Model Equation: y = ", round(intercept, 2), " + ", round(coef_linear, 2), "x + ", round(coef_quadratic, 2), "x^2\n")
## Polynomial Linear Regression Model Equation: y =  59.07  +  -77.15 x +  59.69 x^2
RSS <- sum(poly_model$residuals^2)
TSS <- sum((train$Physical.Activity.Level - mean(train$Physical.Activity.Level))^2)
p <- length(coef(poly_model))
n <- nrow(train)
F_statistic <- ((TSS - RSS) / (p - 1)) / (RSS / (n - p))
df1 <- p - 1
df2 <- n - p
alpha <- 0.05
p_value <- 1 - pf(F_statistic, df1, df2)

if (p_value < alpha) {
  cat("Reject the null hypothesis. The polynomial regression model is statistically significant.\n")
} else {
  cat("Fail to reject the null hypothesis. The polynomial regression model is not statistically significant.\n")
}
## Reject the null hypothesis. The polynomial regression model is statistically significant.
plot(train$Quality.of.Sleep, train$Physical.Activity.Level, main = "Polynomial Regression", xlab = "Quality.of.Sleep", ylab = "Physical.Activity.Level")
xseq <- seq(min(train$Quality.of.Sleep), max(train$Quality.of.Sleep), length.out = length(train$Quality.of.Sleep))
yhat <- predict(poly_model, newdata = data.frame(Quality.of.Sleep = xseq))
lines(xseq, yhat, col = "red")

Test against Validation Set

poly_model <- lm(valid$Physical.Activity.Level ~ poly(valid$Quality.of.Sleep, 2), data = valid)
predictors <- attr(poly_model$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted <- predict(poly_model, newdata = valid)
mse <- mean((valid$Physical.Activity.Level - predicted)^2)
rmse <- sqrt(mse)
n <- nrow(valid)
k <- number_of_predictors
r_squared <- 1 - (sum((valid$Physical.Activity.Level - predicted)^2) / sum((valid$Physical.Activity.Level - mean(valid$Physical.Activity.Level))^2))
adjusted_r_squared <- 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

SSE_poly <- sum(poly_model$residuals^2)
n_poly <- length(poly_model$residuals)
p_poly <- number_of_predictors
Cp_poly <- (SSE_poly / mse) - (n_poly - 2 * p_poly)
AIC_poly <- n_poly * log(SSE_poly / n_poly) + 2 * (p_poly + 1)
BIC_poly <- n_poly * log(SSE_poly / n_poly) + (p_poly + 1) * log(n_poly)

cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 384.6385
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 19.6122
cat("Adjusted R-squared (Adjusted R^2):", adjusted_r_squared, "\n")
## Adjusted R-squared (Adjusted R^2): 0.1392442
cat("Number of Predictors in the Polynomial Model:", p_poly, "\n")
## Number of Predictors in the Polynomial Model: 1
cat("Mallow's Cp for the Polynomial Model:", Cp_poly, "\n")
## Mallow's Cp for the Polynomial Model: 2
cat("AIC for the Polynomial Model:", AIC_poly, "\n")
## AIC for the Polynomial Model: 343.2813
cat("BIC for the Polynomial Model:", BIC_poly, "\n")
## BIC for the Polynomial Model: 347.3674
plot(valid$Quality.of.Sleep, valid$Physical.Activity.Level, main = "Polynomial Regression (Validation Set)", xlab = "Quality.of.Sleep", ylab = "Physical.Activity.Level")
lines(xseq, yhat, col = "blue")
points(valid$Quality.of.Sleep, predicted, col = "red", pch = 20)

Test against Test Set

poly_model_test <- lm(test$Physical.Activity.Level ~ poly(test$Quality.of.Sleep, 2), data = test)
predictors <- attr(poly_model_test$terms, "term.labels")
number_of_predictors <- length(predictors)

predicted_test <- predict(poly_model_test, newdata = test)
mse_test <- mean((test$Physical.Activity.Level - predicted_test)^2)
rmse_test <- sqrt(mse_test)
n_test <- nrow(test)
k_test <- number_of_predictors
r_squared_test <- 1 - (sum((test$Physical.Activity.Level - predicted_test)^2) / sum((test$Physical.Activity.Level - mean(test$Physical.Activity.Level))^2))
adjusted_r_squared_test <- 1 - (1 - r_squared_test) * (n_test - 1) / (n_test - k_test - 1)
SSE_poly_test <- sum(poly_model_test$residuals^2)
n_poly_test <- length(poly_model_test$residuals)
Cp_poly_test <- (SSE_poly_test / mse_test) - (n_poly_test - 2 * k_test)
AIC_poly_test <- n_poly_test * log(SSE_poly_test / n_poly_test) + 2 * (k_test + 1)
BIC_poly_test <- n_poly_test * log(SSE_poly_test / n_poly_test) + (k_test + 1) * log(n_poly_test)

cat("Mean Squared Error (MSE) on the test set:", mse_test, "\n")
## Mean Squared Error (MSE) on the test set: 400.3386
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 20.00846
cat("Adjusted R-squared (Adjusted R^2) on the test set:", adjusted_r_squared_test, "\n")
## Adjusted R-squared (Adjusted R^2) on the test set: 0.130602
cat("Number of Predictors in the Polynomial Model on the test set:", k_test, "\n")
## Number of Predictors in the Polynomial Model on the test set: 1
cat("Mallow's Cp for the Polynomial Model on the test set:", Cp_poly_test, "\n")
## Mallow's Cp for the Polynomial Model on the test set: 2
cat("AIC for the Polynomial Model on the test set:", AIC_poly_test, "\n")
## AIC for the Polynomial Model on the test set: 327.5848
cat("BIC for the Polynomial Model on the test set:", BIC_poly_test, "\n")
## BIC for the Polynomial Model on the test set: 331.5627
plot(test$Quality.of.Sleep, test$Physical.Activity.Level, main = "Polynomial Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Physical.Activity.Level")
lines(xseq, yhat, col = "blue")
points(test$Quality.of.Sleep, predicted_test, col = "red", pch = 20)

Due to the structure of the data, the line of the polynomial regression model is like a zig-zag shape, instead of a smooth curve. This indicates that perhaps this is not the perfect regression model for this data. None the less, the evaluation metrics tell us that this model has limited explanatory power and prediction accuracy. The low adjusted R-squared values for the validation and test sets, as well as relatively high MSE and RMSE values, suggest that the model may not be the best fit for the data.

2. Nonparametric Regression Models

a. Kernal Regression

Quality.of.Sleep v.s. Age

library(locfit)
## locfit 1.5-9.8    2023-06-11
kernel_model <- locfit(Age ~ Quality.of.Sleep, data = train)
predicted_kernel_train <- predict(kernel_model, newdata = train)

kernel_model_summary <- summary(kernel_model)
print(kernel_model_summary)
## Estimation type: Local Regression 
## 
## Call:
## locfit(formula = Age ~ Quality.of.Sleep, data = train)
## 
## Number of data points:  263 
## Independent variables:  Quality.of.Sleep 
## Evaluation structure: Rectangular Tree 
## Number of evaluation points:  7 
## Degree of fit:  2 
## Fitted Degrees of Freedom:  4.661
mse_kernel_train <- mean((train$Age - predicted_kernel_train)^2)
rmse_kernel_train <- sqrt(mse_kernel_train)
n_train <- nrow(train)
k_kernel_train <- 1  
r_squared_kernel_train <- 1 - (sum((train$Age - predicted_kernel_train)^2) / sum((train$Age - mean(train$Age))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_kernel_train, "\n")
## Mean Squared Error (MSE) on the training set: 34.26835
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_kernel_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 5.853918
cat("R-squared on the training set:", r_squared_kernel_train, "\n")
## R-squared on the training set: 0.5465881
plot(train$Quality.of.Sleep, train$Age, main = "Kernel Regression (Train Set)", xlab = "Quality of Sleep", ylab = "Age")
xseq <- seq(min(train$Quality.of.Sleep), max(train$Quality.of.Sleep), length.out = length(train$Quality.of.Sleep))
yhat <- predict(kernel_model, newdata = data.frame(Quality.of.Sleep = xseq))
lines(xseq, yhat, col = "red")

Test against Validation Set

predicted_kernel <- predict(kernel_model, newdata = valid)

mse_kernel <- mean((valid$Age - predicted_kernel)^2)
rmse_kernel <- sqrt(mse_kernel)

n_kernel <- nrow(valid)
k_kernel <- 1 
r_squared_kernel <- 1 - (sum((valid$Age - predicted_kernel)^2) / sum((valid$Age - mean(valid$Age))^2))

cat("Mean Squared Error (MSE):", mse_kernel, "\n")
## Mean Squared Error (MSE): 34.03129
cat("Root Mean Squared Error (RMSE):", rmse_kernel, "\n")
## Root Mean Squared Error (RMSE): 5.833634
cat("R-squared (R^2):", r_squared_kernel, "\n")
## R-squared (R^2): 0.5598667
plot(valid$Quality.of.Sleep, valid$Age, main = "Kernel Regression (Validation Set)", xlab = "Quality of Sleep", ylab = "Age")
lines(xseq, yhat, col = "blue")
points(valid$Quality.of.Sleep, predicted_kernel, col = "red", pch = 20)

Test against Test Set

predicted_kernel_test <- predict(kernel_model, newdata = test)

mse_kernel_test <- mean((test$Age - predicted_kernel_test)^2)
rmse_kernel_test <- sqrt(mse_kernel_test)
n_kernel_test <- nrow(test)
r_squared_kernel_test <- 1 - (sum((test$Age - predicted_kernel_test)^2) / sum((test$Age - mean(test$Age))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_kernel_test, "\n")
## Mean Squared Error (MSE) on the test set: 41.95146
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_kernel_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 6.476995
cat("R-squared (R^2) on the test set:", r_squared_kernel_test, "\n")
## R-squared (R^2) on the test set: 0.3960331
plot(test$Quality.of.Sleep, test$Age, main = "Kernel Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Age")
lines(xseq, yhat, col = "blue")
points(test$Quality.of.Sleep, predicted_kernel_test, col = "red", pch = 20)

The kernel regression model for quality of sleep and age appears to provide a reasonably good fit to the data, as evidenced by the relatively low MSE and RMSE values and the moderate R-squared values for all three sets of data. The test set performs well compared to the validation and training sets, as it has lower MSE & RMSE values. This suggests that the model’s predictions on the test set are accurate. The R-squared value on the test set suggests that the model explains more than half of the variance in the response variable for new, unseen data, which is a reasonably good fit.

Quality.of.Sleep v.s. Sleep.Duration

kernel_model <- locfit(Sleep.Duration ~ Quality.of.Sleep, data = train)
predicted_kernel_train <- predict(kernel_model, newdata = train)

kernel_model_summary <- summary(kernel_model)
print(kernel_model_summary)
## Estimation type: Local Regression 
## 
## Call:
## locfit(formula = Sleep.Duration ~ Quality.of.Sleep, data = train)
## 
## Number of data points:  263 
## Independent variables:  Quality.of.Sleep 
## Evaluation structure: Rectangular Tree 
## Number of evaluation points:  7 
## Degree of fit:  2 
## Fitted Degrees of Freedom:  4.661
mse_kernel_train <- mean((train$Sleep.Duration - predicted_kernel_train)^2)
rmse_kernel_train <- sqrt(mse_kernel_train)
n_train <- nrow(train)
k_kernel_train <- 1  
r_squared_kernel_train <- 1 - (sum((train$Sleep.Duration - predicted_kernel_train)^2) / sum((train$Sleep.Duration - mean(train$Sleep.Duration))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_kernel_train, "\n")
## Mean Squared Error (MSE) on the training set: 0.1087478
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_kernel_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 0.3297694
cat("R-squared on the training set:", r_squared_kernel_train, "\n")
## R-squared on the training set: 0.8252159
plot(train$Quality.of.Sleep, train$Sleep.Duration, main = "Kernel Regression (Train Set)", xlab = "Quality of Sleep", ylab = "Sleep.Duration")
xseq <- seq(min(train$Quality.of.Sleep), max(train$Quality.of.Sleep), length.out = length(train$Quality.of.Sleep))
yhat <- predict(kernel_model, newdata = data.frame(Quality.of.Sleep = xseq))
lines(xseq, yhat, col = "red")

Test against Validation Set

predicted_kernel <- predict(kernel_model, newdata = valid)

mse_kernel <- mean((valid$Sleep.Duration - predicted_kernel)^2)
rmse_kernel <- sqrt(mse_kernel)

n_kernel <- nrow(valid)
k_kernel <- 1 
r_squared_kernel <- 1 - (sum((valid$Sleep.Duration - predicted_kernel)^2) / sum((valid$Sleep.Duration - mean(valid$Sleep.Duration))^2))

cat("Mean Squared Error (MSE):", mse_kernel, "\n")
## Mean Squared Error (MSE): 0.1085911
cat("Root Mean Squared Error (RMSE):", rmse_kernel, "\n")
## Root Mean Squared Error (RMSE): 0.3295316
cat("R-squared (R^2):", r_squared_kernel, "\n")
## R-squared (R^2): 0.8331588
plot(valid$Quality.of.Sleep, valid$Sleep.Duration, main = "Kernel Regression (Validation Set)", xlab = "Quality of Sleep", ylab = "Sleep.Duration")
lines(xseq, yhat, col = "blue")
points(valid$Quality.of.Sleep, predicted_kernel, col = "red", pch = 20)

Test against Test Set

predicted_kernel_test <- predict(kernel_model, newdata = test)

mse_kernel_test <- mean((test$Sleep.Duration - predicted_kernel_test)^2)
rmse_kernel_test <- sqrt(mse_kernel_test)
n_kernel_test <- nrow(test)
r_squared_kernel_test <- 1 - (sum((test$Sleep.Duration - predicted_kernel_test)^2) / sum((test$Sleep.Duration - mean(test$Sleep.Duration))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_kernel_test, "\n")
## Mean Squared Error (MSE) on the test set: 0.1049263
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_kernel_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 0.3239234
cat("R-squared (R^2) on the test set:", r_squared_kernel_test, "\n")
## R-squared (R^2) on the test set: 0.8395063
plot(test$Quality.of.Sleep, test$Sleep.Duration, main = "Kernel Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Sleep.Duration")
lines(xseq, yhat, col = "blue")
points(test$Quality.of.Sleep, predicted_kernel_test, col = "red", pch = 20)

From the above evaluations, we can see that this kernel regression model appears to provide an perfect fit for the data. It shows great predictive performance with very low MSE and RMSE values. Additionally, the high R-squared values suggest that the model captures a significant portion of the relationship between the the quality of sleep and the sleep duration. The R-squared value on the test set suggests that the model explains more than 3 quarters of the variance in the sleep duration for new, unseen data.This indicates that this model is highly reliable for predicting the sleep duration based on the given data, and there is a strong correlation between the variables.

Quality.of.Sleep v.s. Daily.Steps

kernel_model <- locfit(Daily.Steps ~ Quality.of.Sleep, data = train)
predicted_kernel_train <- predict(kernel_model, newdata = train)

kernel_model_summary <- summary(kernel_model)
print(kernel_model_summary)
## Estimation type: Local Regression 
## 
## Call:
## locfit(formula = Daily.Steps ~ Quality.of.Sleep, data = train)
## 
## Number of data points:  263 
## Independent variables:  Quality.of.Sleep 
## Evaluation structure: Rectangular Tree 
## Number of evaluation points:  7 
## Degree of fit:  2 
## Fitted Degrees of Freedom:  4.661
mse_kernel_train <- mean((train$Daily.Steps - predicted_kernel_train)^2)
rmse_kernel_train <- sqrt(mse_kernel_train)
n_train <- nrow(train)
k_kernel_train <- 1  
r_squared_kernel_train <- 1 - (sum((train$Daily.Steps - predicted_kernel_train)^2) / sum((train$Daily.Steps - mean(train$Daily.Steps))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_kernel_train, "\n")
## Mean Squared Error (MSE) on the training set: 2245113
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_kernel_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 1498.37
cat("R-squared on the training set:", r_squared_kernel_train, "\n")
## R-squared on the training set: 0.1792859
plot(train$Quality.of.Sleep, train$Daily.Steps, main = "Kernel Regression (Train Set)", xlab = "Quality of Sleep", ylab = "Daily.Steps")
xseq <- seq(min(train$Quality.of.Sleep), max(train$Quality.of.Sleep), length.out = length(train$Quality.of.Sleep))
yhat <- predict(kernel_model, newdata = data.frame(Quality.of.Sleep = xseq))
lines(xseq, yhat, col = "red")

Test against Validation Set

predicted_kernel <- predict(kernel_model, newdata = valid)

mse_kernel <- mean((valid$Daily.Steps - predicted_kernel)^2)
rmse_kernel <- sqrt(mse_kernel)

n_kernel <- nrow(valid)
k_kernel <- 1 
r_squared_kernel <- 1 - (sum((valid$Daily.Steps - predicted_kernel)^2) / sum((valid$Daily.Steps - mean(valid$Daily.Steps))^2))

cat("Mean Squared Error (MSE):", mse_kernel, "\n")
## Mean Squared Error (MSE): 1758961
cat("Root Mean Squared Error (RMSE):", rmse_kernel, "\n")
## Root Mean Squared Error (RMSE): 1326.258
cat("R-squared (R^2):", r_squared_kernel, "\n")
## R-squared (R^2): 0.1004973
plot(valid$Quality.of.Sleep, valid$Daily.Steps, main = "Kernel Regression (Validation Set)", xlab = "Quality of Sleep", ylab = "Daily.Steps")
lines(xseq, yhat, col = "blue")
points(valid$Quality.of.Sleep, predicted_kernel, col = "red", pch = 20)

Test against Test Set

predicted_kernel_test <- predict(kernel_model, newdata = test)

mse_kernel_test <- mean((test$Daily.Steps - predicted_kernel_test)^2)
rmse_kernel_test <- sqrt(mse_kernel_test)
n_kernel_test <- nrow(test)
r_squared_kernel_test <- 1 - (sum((test$Daily.Steps - predicted_kernel_test)^2) / sum((test$Daily.Steps - mean(test$Daily.Steps))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_kernel_test, "\n")
## Mean Squared Error (MSE) on the test set: 2391474
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_kernel_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 1546.439
cat("R-squared (R^2) on the test set:", r_squared_kernel_test, "\n")
## R-squared (R^2) on the test set: 0.1076591
plot(test$Quality.of.Sleep, test$Daily.Steps, main = "Kernel Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Daily.Steps")
lines(xseq, yhat, col = "blue")
points(test$Quality.of.Sleep, predicted_kernel_test, col = "red", pch = 20)

For the quality of sleep and the daily steps, this kernel regression model does not appear to provide a good fit for the data. The high MSE and RMSE values suggest that the model’s predictions have large errors. Additionally, the low R-squared values indicate that the model captures only a small portion of the relationship between the sleep quality and the number of daily steps. The R-squared value on the test set suggests that the model explains only approximately 15% of the variance in the daily steps for new, unseen data. This is a weak fit and indicates that the model’s predictive power is limited, limited correlation.

Quality.of.Sleep v.s. Physical.Activity.Level

kernel_model <- locfit(Physical.Activity.Level ~ Quality.of.Sleep, data = train)
predicted_kernel_train <- predict(kernel_model, newdata = train)

kernel_model_summary <- summary(kernel_model)
print(kernel_model_summary)
## Estimation type: Local Regression 
## 
## Call:
## locfit(formula = Physical.Activity.Level ~ Quality.of.Sleep, 
##     data = train)
## 
## Number of data points:  263 
## Independent variables:  Quality.of.Sleep 
## Evaluation structure: Rectangular Tree 
## Number of evaluation points:  7 
## Degree of fit:  2 
## Fitted Degrees of Freedom:  4.661
mse_kernel_train <- mean((train$Physical.Activity.Level - predicted_kernel_train)^2)
rmse_kernel_train <- sqrt(mse_kernel_train)
n_train <- nrow(train)
k_kernel_train <- 1  
r_squared_kernel_train <- 1 - (sum((train$Physical.Activity.Level - predicted_kernel_train)^2) / sum((train$Physical.Activity.Level - mean(train$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_kernel_train, "\n")
## Mean Squared Error (MSE) on the training set: 372.3782
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_kernel_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 19.2971
cat("R-squared on the training set:", r_squared_kernel_train, "\n")
## R-squared on the training set: 0.1142322
plot(train$Quality.of.Sleep, train$Physical.Activity.Level, main = "Kernel Regression (Train Set)", xlab = "Quality of Sleep", ylab = "Daily.Steps")
xseq <- seq(min(train$Quality.of.Sleep), max(train$Quality.of.Sleep), length.out = length(train$Quality.of.Sleep))
yhat <- predict(kernel_model, newdata = data.frame(Quality.of.Sleep = xseq))
lines(xseq, yhat, col = "red")

Test against Validation Set

predicted_kernel <- predict(kernel_model, newdata = valid)

mse_kernel <- mean((valid$Physical.Activity.Level - predicted_kernel)^2)
rmse_kernel <- sqrt(mse_kernel)

n_kernel <- nrow(valid)
k_kernel <- 1 
r_squared_kernel <- 1 - (sum((valid$Physical.Activity.Level - predicted_kernel)^2) / sum((valid$Physical.Activity.Level - mean(valid$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE):", mse_kernel, "\n")
## Mean Squared Error (MSE): 388.5592
cat("Root Mean Squared Error (RMSE):", rmse_kernel, "\n")
## Root Mean Squared Error (RMSE): 19.7119
cat("R-squared (R^2):", r_squared_kernel, "\n")
## R-squared (R^2): 0.1459978
plot(valid$Quality.of.Sleep, valid$Physical.Activity.Level, main = "Kernel Regression (Validation Set)", xlab = "Quality of Sleep", ylab = "Physical.Activity.Level")
lines(xseq, yhat, col = "blue")
points(valid$Quality.of.Sleep, predicted_kernel, col = "red", pch = 20)

Test against Test Set

predicted_kernel_test <- predict(kernel_model, newdata = test)

mse_kernel_test <- mean((test$Physical.Activity.Level - predicted_kernel_test)^2)
rmse_kernel_test <- sqrt(mse_kernel_test)
n_kernel_test <- nrow(test)
r_squared_kernel_test <- 1 - (sum((test$Physical.Activity.Level - predicted_kernel_test)^2) / sum((test$Physical.Activity.Level - mean(test$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_kernel_test, "\n")
## Mean Squared Error (MSE) on the test set: 388.3439
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_kernel_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 19.70644
cat("R-squared (R^2) on the test set:", r_squared_kernel_test, "\n")
## R-squared (R^2) on the test set: 0.1725627
plot(test$Quality.of.Sleep, test$Physical.Activity.Level, main = "Kernel Regression (Test Set)", xlab = "Quality of Sleep", ylab = "Physical.Activity.Level")
lines(xseq, yhat, col = "blue")
points(test$Quality.of.Sleep, predicted_kernel_test, col = "red", pch = 20)

Based on the above evaluation metrics, this kernel regression model does not appear to provide a strong fit for the data. The relatively high MSE and RMSE values suggest that the model’s predictions have substantial errors. Moreover, the low R-squared values indicate that the model captures only a small portion of the relationship between the sleep quality and the level of physical activity. This suggests a weak or limited correlation between the variables.

b. Regression Trees

Quality.of.Sleep v.s. Age

library(rpart)

tree_model <- rpart(Age ~ Quality.of.Sleep, data = train)
print(tree_model)
## n= 263 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 263 19877.240 42.18251  
##    2) Quality.of.Sleep< 8.5 213  9083.221 39.17840  
##      4) Quality.of.Sleep< 5.5 9   138.000 31.66667 *
##      5) Quality.of.Sleep>=5.5 204  8414.980 39.50980  
##       10) Quality.of.Sleep>=6.5 131  3647.221 38.66412 *
##       11) Quality.of.Sleep< 6.5 73  4505.945 41.02740 *
##    3) Quality.of.Sleep>=8.5 50   682.980 54.98000 *
plot(tree_model)
text(tree_model, use.n=TRUE, cex = 0.7)

predicted_tree_train <- predict(tree_model, newdata = train)

mse_tree_train <- mean((train$Age - predicted_tree_train)^2)
rmse_tree_train <- sqrt(mse_tree_train)
n_train <- nrow(train)
k_tree_train <- 1
r_squared_tree_train <- 1 - (sum((train$Age - predicted_tree_train)^2) / sum((train$Age - mean(train$Age))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_tree_train, "\n")
## Mean Squared Error (MSE) on the training set: 34.12223
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_tree_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 5.841424
cat("R-squared on the training set:", r_squared_tree_train, "\n")
## R-squared on the training set: 0.5485215
predicted_tree_valid <- predict(tree_model, newdata = valid)

mse_tree_valid <- mean((valid$Age - predicted_tree_valid)^2)
rmse_tree_valid <- sqrt(mse_tree_valid)
n_valid <- nrow(valid)
k_tree_valid <- 1 
r_squared_tree_valid <- 1 - (sum((valid$Age - predicted_tree_valid)^2) / sum((valid$Age - mean(valid$Age))^2))

cat("Mean Squared Error (MSE) on the validation set:", mse_tree_valid, "\n")
## Mean Squared Error (MSE) on the validation set: 33.7074
cat("Root Mean Squared Error (RMSE) on the validation set:", rmse_tree_valid, "\n")
## Root Mean Squared Error (RMSE) on the validation set: 5.805807
cat("R-squared on the validation set:", r_squared_tree_valid, "\n")
## R-squared on the validation set: 0.5640556
predicted_tree_test <- predict(tree_model, newdata = test)

mse_tree_test <- mean((test$Age - predicted_tree_test)^2)
rmse_tree_test <- sqrt(mse_tree_test)
n_test <- nrow(test)
r_squared_tree_test <- 1 - (sum((test$Age - predicted_tree_test)^2) / sum((test$Age - mean(test$Age))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_tree_test, "\n")
## Mean Squared Error (MSE) on the test set: 39.06635
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_tree_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 6.250308
cat("R-squared on the test set:", r_squared_tree_test, "\n")
## R-squared on the test set: 0.4375695

The regression tree model for quality of sleep and age appears to provide a good fit for the data. The relatively low MSE and RMSE values suggest that the model’s predictions have relatively small errors. Additionally, the moderate R-squared values indicate that the model captures a significant portion of the relationship between the variables. Yet, the r-squared value on the test set show that it may not predict accurately. Thus, it is suggested that there is a moderately correlation between the the quality of sleep and the age.

Quality.of.Sleep v.s. Sleep.Duration

tree_model <- rpart(Sleep.Duration ~ Quality.of.Sleep, data = train)
print(tree_model)
## n= 263 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 263 163.634400 7.125095  
##    2) Quality.of.Sleep< 6.5 82   4.152195 6.209756 *
##    3) Quality.of.Sleep>=6.5 181  59.653590 7.539779  
##      6) Quality.of.Sleep< 8.5 131  24.065500 7.271756  
##       12) Quality.of.Sleep< 7.5 54  15.928150 7.114815 *
##       13) Quality.of.Sleep>=7.5 77   5.874545 7.381818 *
##      7) Quality.of.Sleep>=8.5 50   1.521800 8.242000 *
plot(tree_model)
text(tree_model, use.n=TRUE, cex = 0.7)

predicted_tree_train <- predict(tree_model, newdata = train)

mse_tree_train <- mean((train$Sleep.Duration - predicted_tree_train)^2)
rmse_tree_train <- sqrt(mse_tree_train)
n_train <- nrow(train)
k_tree_train <- 1
r_squared_tree_train <- 1 - (sum((train$Sleep.Duration - predicted_tree_train)^2) / sum((train$Sleep.Duration - mean(train$Sleep.Duration))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_tree_train, "\n")
## Mean Squared Error (MSE) on the training set: 0.1044741
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_tree_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 0.3232245
cat("R-squared on the training set:", r_squared_tree_train, "\n")
## R-squared on the training set: 0.8320849
predicted_tree_valid <- predict(tree_model, newdata = valid)

mse_tree_valid <- mean((valid$Sleep.Duration - predicted_tree_valid)^2)
rmse_tree_valid <- sqrt(mse_tree_valid)
n_valid <- nrow(valid)
k_tree_valid <- 1 
r_squared_tree_valid <- 1 - (sum((valid$Sleep.Duration - predicted_tree_valid)^2) / sum((valid$Sleep.Duration - mean(valid$Sleep.Duration))^2))

cat("Mean Squared Error (MSE) on the validation set:", mse_tree_valid, "\n")
## Mean Squared Error (MSE) on the validation set: 0.107131
cat("Root Mean Squared Error (RMSE) on the validation set:", rmse_tree_valid, "\n")
## Root Mean Squared Error (RMSE) on the validation set: 0.3273087
cat("R-squared on the validation set:", r_squared_tree_valid, "\n")
## R-squared on the validation set: 0.8354021
predicted_tree_test <- predict(tree_model, newdata = test)

mse_tree_test <- mean((test$Sleep.Duration - predicted_tree_test)^2)
rmse_tree_test <- sqrt(mse_tree_test)
n_test <- nrow(test)
r_squared_tree_test <- 1 - (sum((test$Sleep.Duration - predicted_tree_test)^2) / sum((test$Sleep.Duration - mean(test$Sleep.Duration))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_tree_test, "\n")
## Mean Squared Error (MSE) on the test set: 0.09371114
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_tree_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 0.3061227
cat("R-squared on the test set:", r_squared_tree_test, "\n")
## R-squared on the test set: 0.8566609

This regression tree model has very low MSE and RMSE values suggest that the model’s predictions have extremely small errors. The high R-squared values indicate that the model captures a significant portion of the relationship between the sleep quality and the sleep duration. This suggests a fairly strong correlation between the variables.

Quality.of.Sleep v.s. Daily.Steps

tree_model <- rpart(Daily.Steps ~ Quality.of.Sleep, data = train)
print(tree_model)
## n= 263 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 263 719452500 6805.323  
##    2) Quality.of.Sleep< 5.5 9   3295556 3977.778 *
##    3) Quality.of.Sleep>=5.5 254 641652300 6905.512  
##      6) Quality.of.Sleep>=8.5 50  84503200 6144.000 *
##      7) Quality.of.Sleep< 8.5 204 521047500 7092.157  
##       14) Quality.of.Sleep< 7.5 127 461390200 6869.291  
##         28) Quality.of.Sleep>=6.5 54  96155000 6550.000 *
##         29) Quality.of.Sleep< 6.5 73 355657800 7105.479 *
##       15) Quality.of.Sleep>=7.5 77  42945190 7459.740 *
plot(tree_model)
text(tree_model, use.n=TRUE, cex = 0.7)

predicted_tree_train <- predict(tree_model, newdata = train)

mse_tree_train <- mean((train$Daily.Steps - predicted_tree_train)^2)
rmse_tree_train <- sqrt(mse_tree_train)
n_train <- nrow(train)
k_tree_train <- 1
r_squared_tree_train <- 1 - (sum((train$Daily.Steps - predicted_tree_train)^2) / sum((train$Daily.Steps - mean(train$Daily.Steps))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_tree_train, "\n")
## Mean Squared Error (MSE) on the training set: 2215045
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_tree_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 1488.303
cat("R-squared on the training set:", r_squared_tree_train, "\n")
## R-squared on the training set: 0.1902777
predicted_tree_valid <- predict(tree_model, newdata = valid)

mse_tree_valid <- mean((valid$Daily.Steps - predicted_tree_valid)^2)
rmse_tree_valid <- sqrt(mse_tree_valid)
n_valid <- nrow(valid)
k_tree_valid <- 1 
r_squared_tree_valid <- 1 - (sum((valid$Daily.Steps - predicted_tree_valid)^2) / sum((valid$Daily.Steps - mean(valid$Daily.Steps))^2))

cat("Mean Squared Error (MSE) on the validation set:", mse_tree_valid, "\n")
## Mean Squared Error (MSE) on the validation set: 1769725
cat("Root Mean Squared Error (RMSE) on the validation set:", rmse_tree_valid, "\n")
## Root Mean Squared Error (RMSE) on the validation set: 1330.31
cat("R-squared on the validation set:", r_squared_tree_valid, "\n")
## R-squared on the validation set: 0.09499287
predicted_tree_test <- predict(tree_model, newdata = test)

mse_tree_test <- mean((test$Daily.Steps - predicted_tree_test)^2)
rmse_tree_test <- sqrt(mse_tree_test)
n_test <- nrow(test)
r_squared_tree_test <- 1 - (sum((test$Daily.Steps - predicted_tree_test)^2) / sum((test$Daily.Steps - mean(test$Daily.Steps))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_tree_test, "\n")
## Mean Squared Error (MSE) on the test set: 2016337
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_tree_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 1419.978
cat("R-squared on the test set:", r_squared_tree_test, "\n")
## R-squared on the test set: 0.2476354

Based on the above, it can be inferred that this regression tree model does not provide a strong fit for the data. The relatively high MSE and RMSE values indicate that the model’s predictions have a large spread of errors. The low R-squared values suggest that the model explains only a small proportion of the variance in the daily steps variable, indicating a weak correlation between the quality of sleep and the number of daily steps.

Quality.of.Sleep v.s. Physical.Activity.Level

tree_model <- rpart(Physical.Activity.Level ~ Quality.of.Sleep, data = train)
print(tree_model)
## n= 263 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 263 110565.6000 59.07224  
##    2) Quality.of.Sleep< 5.5 9    122.2222 35.55556 *
##    3) Quality.of.Sleep>=5.5 254 105289.7000 59.90551  
##      6) Quality.of.Sleep< 7.5 127  55908.0900 56.38583 *
##      7) Quality.of.Sleep>=7.5 127  46235.0400 63.42520  
##       14) Quality.of.Sleep>=8.5 50  27264.5000 56.10000 *
##       15) Quality.of.Sleep< 8.5 77  14545.4500 68.18182 *
plot(tree_model)
text(tree_model, use.n=TRUE, cex = 0.7)

predicted_tree_train <- predict(tree_model, newdata = train)

mse_tree_train <- mean((train$Physical.Activity.Level - predicted_tree_train)^2)
rmse_tree_train <- sqrt(mse_tree_train)
n_train <- nrow(train)
k_tree_train <- 1
r_squared_tree_train <- 1 - (sum((train$Physical.Activity.Level - predicted_tree_train)^2) / sum((train$Physical.Activity.Level - mean(train$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_tree_train, "\n")
## Mean Squared Error (MSE) on the training set: 372.0162
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_tree_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 19.28772
cat("R-squared on the training set:", r_squared_tree_train, "\n")
## R-squared on the training set: 0.1150932
predicted_tree_valid <- predict(tree_model, newdata = valid)

mse_tree_valid <- mean((valid$Physical.Activity.Level - predicted_tree_valid)^2)
rmse_tree_valid <- sqrt(mse_tree_valid)
n_valid <- nrow(valid)
k_tree_valid <- 1 
r_squared_tree_valid <- 1 - (sum((valid$Physical.Activity.Level - predicted_tree_valid)^2) / sum((valid$Physical.Activity.Level - mean(valid$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE) on the validation set:", mse_tree_valid, "\n")
## Mean Squared Error (MSE) on the validation set: 394.6096
cat("Root Mean Squared Error (RMSE) on the validation set:", rmse_tree_valid, "\n")
## Root Mean Squared Error (RMSE) on the validation set: 19.86478
cat("R-squared on the validation set:", r_squared_tree_valid, "\n")
## R-squared on the validation set: 0.1326998
predicted_tree_test <- predict(tree_model, newdata = test)

mse_tree_test <- mean((test$Physical.Activity.Level - predicted_tree_test)^2)
rmse_tree_test <- sqrt(mse_tree_test)
n_test <- nrow(test)
r_squared_tree_test <- 1 - (sum((test$Physical.Activity.Level - predicted_tree_test)^2) / sum((test$Physical.Activity.Level - mean(test$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_tree_test, "\n")
## Mean Squared Error (MSE) on the test set: 387.7837
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_tree_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 19.69222
cat("R-squared on the test set:", r_squared_tree_test, "\n")
## R-squared on the test set: 0.1737563

The regression tree model does not provide a strong fit for the data. The relatively high MSE and RMSE values indicate that the model’s predictions have a large spread of errors. The low R-squared values suggest that the model explains only a small proportion of the variance in the physical activity level variable, indicating a weak correlation.

c. Locally Weighted Regression

Quality.of.Sleep v.s. Age

library(ggplot2)

loess_model <- loess(Age ~ Quality.of.Sleep, data = train, span = 0.8) 
predicted_loess_train <- predict(loess_model, newdata = train)

mse_loess_train <- mean((train$Age - predicted_loess_train)^2)
rmse_loess_train <- sqrt(mse_loess_train)
r_squared_loess_train <- 1 - (sum((train$Age - predicted_loess_train)^2) / sum((train$Age - mean(train$Age))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_loess_train, "\n")
## Mean Squared Error (MSE) on the training set: 34.2713
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_loess_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 5.854169
cat("R-squared on the training set:", r_squared_loess_train, "\n")
## R-squared on the training set: 0.5465492
ggplot(train, aes(x = Quality.of.Sleep, y = Age)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "red") +
  labs(title = "LOESS Regression (Train Set)", x = "Quality of Sleep", y = "Age")
## `geom_smooth()` using formula = 'y ~ x'

Test against Validation Set

predicted_loess_valid <- predict(loess_model, newdata = valid)

mse_loess_valid <- mean((valid$Age - predicted_loess_valid)^2)
rmse_loess_valid <- sqrt(mse_loess_valid)
r_squared_loess_valid <- 1 - (sum((valid$Age - predicted_loess_valid)^2) / sum((valid$Age - mean(valid$Age))^2))

cat("Mean Squared Error (MSE) on the validation set:", mse_loess_valid, "\n")
## Mean Squared Error (MSE) on the validation set: 33.98962
cat("Root Mean Squared Error (RMSE) on the validation set:", rmse_loess_valid, "\n")
## Root Mean Squared Error (RMSE) on the validation set: 5.830062
cat("R-squared on the validation set:", r_squared_loess_valid, "\n")
## R-squared on the validation set: 0.5604056
predicted_data <- data.frame(Quality.of.Sleep = valid$Quality.of.Sleep, Age = predicted_loess_valid)

ggplot(valid, aes(x = Quality.of.Sleep, y = Age)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "blue") +
  geom_point(data = predicted_data, aes(x = Quality.of.Sleep, y = Age), col = "red", pch = 20) +
  labs(title = "LOESS Regression (Validation Set)", x = "Quality of Sleep", y = "Age")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 5.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 9.8049e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 5.985
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 9.8049e-17
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1

Test against Test Set

predicted_loess_test <- predict(loess_model, newdata = test)
mse_loess_test <- mean((test$Age - predicted_loess_test)^2)
rmse_loess_test <- sqrt(mse_loess_test)
r_squared_loess_test <- 1 - (sum((test$Age - predicted_loess_test)^2) / sum((test$Age - mean(test$Age))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_loess_test, "\n")
## Mean Squared Error (MSE) on the test set: 41.93021
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_loess_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 6.475354
cat("R-squared on the test set:", r_squared_loess_test, "\n")
## R-squared on the test set: 0.3963391
predicted_data_test <- data.frame(Quality.of.Sleep = test$Quality.of.Sleep, Age = predicted_loess_test)

ggplot(test, aes(x = Quality.of.Sleep, y = Age)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "blue") +
  geom_point(data = predicted_data_test, aes(x = Quality.of.Sleep, y = Age), col = "red", pch = 20) +
  labs(title = "LOESS Regression (Test Set)", x = "Quality of Sleep", y = "Age")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 6
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 5.5564e-17
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 6
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 5.5564e-17

This locally weighted regression model for the quality of sleep and the age has accurate predictions. It has lower MSE and RMSE values, suggesting that the model has small errors. Furthermore, the moderately high R-squared values indicate a strong correlation and a good fit between the variables.

Quality.of.Sleep v.s. Sleep.Duration

loess_model <- loess(Sleep.Duration ~ Quality.of.Sleep, data = train, span = 0.8) 
predicted_loess_train <- predict(loess_model, newdata = train)

mse_loess_train <- mean((train$Sleep.Duration - predicted_loess_train)^2)
rmse_loess_train <- sqrt(mse_loess_train)
r_squared_loess_train <- 1 - (sum((train$Sleep.Duration - predicted_loess_train)^2) / sum((train$Sleep.Duration - mean(train$Sleep.Duration))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_loess_train, "\n")
## Mean Squared Error (MSE) on the training set: 0.1087308
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_loess_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 0.3297435
cat("R-squared on the training set:", r_squared_loess_train, "\n")
## R-squared on the training set: 0.8252433
ggplot(train, aes(x = Quality.of.Sleep, y = Sleep.Duration)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "red") +
  labs(title = "LOESS Regression (Train Set)", x = "Quality of Sleep", y = "Sleep.Duration")
## `geom_smooth()` using formula = 'y ~ x'

Test against Validation Set

predicted_loess_valid <- predict(loess_model, newdata = valid)

mse_loess_valid <- mean((valid$Sleep.Duration - predicted_loess_valid)^2)
rmse_loess_valid <- sqrt(mse_loess_valid)
r_squared_loess_valid <- 1 - (sum((valid$Sleep.Duration - predicted_loess_valid)^2) / sum((valid$Sleep.Duration - mean(valid$Sleep.Duration))^2))

cat("Mean Squared Error (MSE) on the validation set:", mse_loess_valid, "\n")
## Mean Squared Error (MSE) on the validation set: 0.10713
cat("Root Mean Squared Error (RMSE) on the validation set:", rmse_loess_valid, "\n")
## Root Mean Squared Error (RMSE) on the validation set: 0.3273072
cat("R-squared on the validation set:", r_squared_loess_valid, "\n")
## R-squared on the validation set: 0.8354036
predicted_data <- data.frame(Quality.of.Sleep = valid$Quality.of.Sleep, Age = predicted_loess_valid)

ggplot(valid, aes(x = Quality.of.Sleep, y = Sleep.Duration)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "blue") +
  geom_point(data = predicted_data, aes(x = Quality.of.Sleep, y = Age), col = "red", pch = 20) +
  labs(title = "LOESS Regression (Validation Set)", x = "Quality of Sleep", y = "Sleep.Duration")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 5.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 9.8049e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 5.985
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 9.8049e-17
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1

Test against Test Set

predicted_loess_test <- predict(loess_model, newdata = test)
mse_loess_test <- mean((test$Sleep.Duration - predicted_loess_test)^2)
rmse_loess_test <- sqrt(mse_loess_test)
r_squared_loess_test <- 1 - (sum((test$Sleep.Duration - predicted_loess_test)^2) / sum((test$Sleep.Duration - mean(test$Sleep.Duration))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_loess_test, "\n")
## Mean Squared Error (MSE) on the test set: 0.1039611
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_loess_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 0.3224301
cat("R-squared on the test set:", r_squared_loess_test, "\n")
## R-squared on the test set: 0.8409826
predicted_data_test <- data.frame(Quality.of.Sleep = test$Quality.of.Sleep, Sleep.Duration = predicted_loess_test)

ggplot(test, aes(x = Quality.of.Sleep, y = Sleep.Duration)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "blue") +
  geom_point(data = predicted_data_test, aes(x = Quality.of.Sleep, y = Sleep.Duration), col = "red", pch = 20) +
  labs(title = "LOESS Regression (Test Set)", x = "Quality of Sleep", y = "Sleep.Duration")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 6
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 5.5564e-17
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 6
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 5.5564e-17

The locally weighted regression model performs great for the quality of sleep and the sleep duration. The very low MSE and RMSE values indicate highly accurate predictions with minimal errors. Moreover, the very high R-squared values demonstrate an outstanding fit between the the quality of sleep and the sleep duration. The results suggest a strong and positive relationship with high correlation and predictive power.

Quality.of.Sleep v.s. Daily.Steps

loess_model <- loess(Daily.Steps ~ Quality.of.Sleep, data = train, span = 0.8)  # Adjust the span as needed
predicted_loess_train <- predict(loess_model, newdata = train)

mse_loess_train <- mean((train$Daily.Steps - predicted_loess_train)^2)
rmse_loess_train <- sqrt(mse_loess_train)
r_squared_loess_train <- 1 - (sum((train$Daily.Steps - predicted_loess_train)^2) / sum((train$Daily.Steps - mean(train$Daily.Steps))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_loess_train, "\n")
## Mean Squared Error (MSE) on the training set: 2245511
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_loess_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 1498.503
cat("R-squared on the training set:", r_squared_loess_train, "\n")
## R-squared on the training set: 0.1791405
ggplot(train, aes(x = Quality.of.Sleep, y = Daily.Steps)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "red") +
  labs(title = "LOESS Regression (Train Set)", x = "Quality of Sleep", y = "Daily.Steps")
## `geom_smooth()` using formula = 'y ~ x'

Test against Validation Set

predicted_loess_valid <- predict(loess_model, newdata = valid)

mse_loess_valid <- mean((valid$Daily.Steps - predicted_loess_valid)^2)
rmse_loess_valid <- sqrt(mse_loess_valid)
r_squared_loess_valid <- 1 - (sum((valid$Daily.Steps - predicted_loess_valid)^2) / sum((valid$Daily.Steps - mean(valid$Daily.Steps))^2))

cat("Mean Squared Error (MSE) on the validation set:", mse_loess_valid, "\n")
## Mean Squared Error (MSE) on the validation set: 1769725
cat("Root Mean Squared Error (RMSE) on the validation set:", rmse_loess_valid, "\n")
## Root Mean Squared Error (RMSE) on the validation set: 1330.31
cat("R-squared on the validation set:", r_squared_loess_valid, "\n")
## R-squared on the validation set: 0.09499287
predicted_data <- data.frame(Quality.of.Sleep = valid$Quality.of.Sleep, Age = predicted_loess_valid)

ggplot(valid, aes(x = Quality.of.Sleep, y = Daily.Steps)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "blue") +
  geom_point(data = predicted_data, aes(x = Quality.of.Sleep, y = Age), col = "red", pch = 20) +
  labs(title = "LOESS Regression (Validation Set)", x = "Quality of Sleep", y = "Daily.Steps")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 5.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 9.8049e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 5.985
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 9.8049e-17
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1

Test against Test Set

predicted_loess_test <- predict(loess_model, newdata = test)
mse_loess_test <- mean((test$Daily.Steps - predicted_loess_test)^2)
rmse_loess_test <- sqrt(mse_loess_test)
r_squared_loess_test <- 1 - (sum((test$Daily.Steps - predicted_loess_test)^2) / sum((test$Daily.Steps - mean(test$Daily.Steps))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_loess_test, "\n")
## Mean Squared Error (MSE) on the test set: 2389363
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_loess_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 1545.756
cat("R-squared on the test set:", r_squared_loess_test, "\n")
## R-squared on the test set: 0.1084468
predicted_data_test <- data.frame(Quality.of.Sleep = test$Quality.of.Sleep, Daily.Steps = predicted_loess_test)

ggplot(test, aes(x = Quality.of.Sleep, y = Daily.Steps)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "blue") +
  geom_point(data = predicted_data_test, aes(x = Quality.of.Sleep, y = Daily.Steps), col = "red", pch = 20) +
  labs(title = "LOESS Regression (Test Set)", x = "Quality of Sleep", y = "Daily.Steps")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 6
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 5.5564e-17
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 6
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 5.5564e-17

For this locally weighted regression model, as shown by the high MSE and RMSE values, it is suggested that the accuracy is low and the predictions have large errors. Moreover, the low value of r-squared means that the model explains only a small proportion of the variance in the daily steps variable. Thus, this model has a weaker fit and the correlation between the variables is small.

Quality.of.Sleep v.s. Physical.Activity.Level

loess_model <- loess(Physical.Activity.Level ~ Quality.of.Sleep, data = train, span = 0.8)
predicted_loess_train <- predict(loess_model, newdata = train)

mse_loess_train <- mean((train$Physical.Activity.Level - predicted_loess_train)^2)
rmse_loess_train <- sqrt(mse_loess_train)
r_squared_loess_train <- 1 - (sum((train$Physical.Activity.Level - predicted_loess_train)^2) / sum((train$Physical.Activity.Level - mean(train$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE) on the training set:", mse_loess_train, "\n")
## Mean Squared Error (MSE) on the training set: 372.3974
cat("Root Mean Squared Error (RMSE) on the training set:", rmse_loess_train, "\n")
## Root Mean Squared Error (RMSE) on the training set: 19.2976
cat("R-squared on the training set:", r_squared_loess_train, "\n")
## R-squared on the training set: 0.1141866
ggplot(train, aes(x = Quality.of.Sleep, y = Physical.Activity.Level)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "red") +
  labs(title = "LOESS Regression (Train Set)", x = "Quality of Sleep", y = "Physical.Activity.Level")
## `geom_smooth()` using formula = 'y ~ x'

Test against Validation Set

predicted_loess_valid <- predict(loess_model, newdata = valid)

mse_loess_valid <- mean((valid$Physical.Activity.Level - predicted_loess_valid)^2)
rmse_loess_valid <- sqrt(mse_loess_valid)
r_squared_loess_valid <- 1 - (sum((valid$Physical.Activity.Level - predicted_loess_valid)^2) / sum((valid$Physical.Activity.Level - mean(valid$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE) on the validation set:", mse_loess_valid, "\n")
## Mean Squared Error (MSE) on the validation set: 389.5194
cat("Root Mean Squared Error (RMSE) on the validation set:", rmse_loess_valid, "\n")
## Root Mean Squared Error (RMSE) on the validation set: 19.73625
cat("R-squared on the validation set:", r_squared_loess_valid, "\n")
## R-squared on the validation set: 0.1438874
predicted_data <- data.frame(Quality.of.Sleep = valid$Quality.of.Sleep, Age = predicted_loess_valid)

ggplot(valid, aes(x = Quality.of.Sleep, y = Physical.Activity.Level)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "blue") +
  geom_point(data = predicted_data, aes(x = Quality.of.Sleep, y = Age), col = "red", pch = 20) +
  labs(title = "LOESS Regression (Validation Set)", x = "Quality of Sleep", y = "Physical.Activity.Level")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 5.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 9.8049e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 5.985
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 9.8049e-17
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1

Test against Test Set

predicted_loess_test <- predict(loess_model, newdata = test)
mse_loess_test <- mean((test$Physical.Activity.Level - predicted_loess_test)^2)
rmse_loess_test <- sqrt(mse_loess_test)
r_squared_loess_test <- 1 - (sum((test$Physical.Activity.Level - predicted_loess_test)^2) / sum((test$Physical.Activity.Level - mean(test$Physical.Activity.Level))^2))

cat("Mean Squared Error (MSE) on the test set:", mse_loess_test, "\n")
## Mean Squared Error (MSE) on the test set: 388.3364
cat("Root Mean Squared Error (RMSE) on the test set:", rmse_loess_test, "\n")
## Root Mean Squared Error (RMSE) on the test set: 19.70625
cat("R-squared on the test set:", r_squared_loess_test, "\n")
## R-squared on the test set: 0.1725787
predicted_data_test <- data.frame(Quality.of.Sleep = test$Quality.of.Sleep, Physical.Activity.Level = predicted_loess_test)

ggplot(test, aes(x = Quality.of.Sleep, y = Physical.Activity.Level)) +
  geom_point() +
  geom_smooth(method = "loess", span = 0.8, col = "blue") +
  geom_point(data = predicted_data_test, aes(x = Quality.of.Sleep, y = Physical.Activity.Level), col = "red", pch = 20) +
  labs(title = "LOESS Regression (Test Set)", x = "Quality of Sleep", y = "Physical.Activity.Level")
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 6
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 2
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 5.5564e-17
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 6
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 5.5564e-17

From the above analysis, we can see that the MSE and RMSE values are quite high. This tells us that the locally weighted regression model provides less accurate predictions with larger errors. Furthermore, the R-squared value for the test set is extremely low, indicating that the model explains a negligible proportion of the variance in the physical activity level variable. Thus, this model has a weak fit to the data, and there is little to no linear relationship or correlation.

Conclusion

By doing this assignment, I learned that the dataset I decided to analyze is more suitable for non-parametric regression models. The reason is because of the structure of the dataset. The values of quality of sleep are discrete, meaning that it is a value ranging from 4 to 9. Hence, for parametric regression models like polynomial model, the curve would be zig-zag, instead of a smooth curve. By looking at the evaluation metrics, the values from nonparametric models are mostly better than that from parametric models.

The variables I decided to compare with the quality of sleep are the age, the sleep duration, the daily steps, and the physical activity level. Interestingly, results show that sleep duration has the highest positive correlation with sleep quality. It has high r-squared values in many regression models, showing that the models can accurately predict new, unseen data.

Possible Problems To Investigate For Future Studies

  1. If I adjust the proportion of the training, validation, and test sets, would the results be different ?

  2. Is there any other variables that can incluence the quality of sleep ?

  3. Is there any other variables that can influence the factors of quality of sleep (ex. sleep duration) ?